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Chapter 1 
INTRODUCTION 



This AT&T UNIX* PC UNIX System V Programmer's Guide 
describes: 



• C Language, the main programming language available on 
the UNIX system 

• the shell Language available on the UNIX system 

• support tools, various software tools that aid the UNIX 
operating system user. 

C Language, a medium-level programming language, was used 
to write most of the UNIX operating system. Chapter 2 
describes the C language. Chapters 3 through 7 describe the 
libraries and support tools available with the UNIX system for 
the benefit of the C language programmer. These chapters 
contain the following: 

C LANGUAGE— Chapter 2 provides a summary of the 
grammar and rules of the C programming language. 
Chapter 2 describes the C language as it is implemented 
and supported on the UNIX PC, the PDPifi-ll computer, and 
the VAXifi-ll/YSO computer. Where differences exist, these 
chapters try to point out implementation-dependent details. 
With few exceptions, such dependencies follow directly 
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from the properties of the hardware. The various compilers 
are generally quite compatible. 

LIBRARIES— Chapters 3 and 4 describe functions and 
declarations that support the C Language and how to use 
these functions. Chapter 3 describes the C Library and 
Chapter 4 describes the Object File and Math Libraries. 

THE "cc" COMMAND— Chapter 5 describes the 
command used to compile C language programs, produce 
assembly language programs, and produce executable 
programs. 

A C PROGRAM CHECKER "lint"- Chapter 6 
describes a program that attempts to detect compile-time 
bugs and non-portable features in C programs. 

A SYMBOLIC DEBUGGER "sdb"- Chapter 7 
describes a symbolic debugging program that is used to 
debug compiled C language programs. 

Chapter 8 contains a reference manual for the UNIX System 
Assembler for the UNIX PC. 

Chapter 9 describes the curses package that provides a 
programmer with screen-oriented programming capabilities. 

Chapters 10 through 12 provide information on how to use the 
shell Language. 

USING SHELL COMMANDS- Chapter 10 builds on 
the UNIX System User Guide or the "hands-on" experience 
some have acquired. It is intended for those users who 
have some basic familiarity with shell but desire more 
detailed information. 

SHELL PROGRAMMING- Chapter 11 provides 
information for programming with shell. Those users that 
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intend to do shell programming should read Chapter 11 as 
well as Chapter 12. 

EXAMPLES OF SHELL PROCEDURES- Chapter 12 
contains examples of shell programs. 

It is important to note a few things about shell. The shell 
functions as a: 

• Command language— The shell reads command lines 
entered at a terminal and interprets the lines as requests 
to execute other programs. 

• Programming language— The shell is a programming 
language just like BASIC, COBOL, FORTRAN, and other 
languages. The shell is a high-level programming language 
that is easy to learn. The programs written using the shell 
programming language are called shell scripts, procedures, 
or commands. These programs are stored in files and 
executed just like commands. The shell provides variables, 
conditional constructs, and iterative constructs. 

• Working environment— The shell also provides an 
environment that can be tailored to an individual's or 
group's needs by manipulating environment variables. 

Support tools provide an added dimension to the basic UNIX 
software commands. The tools described in the following 
chapters enable users to fully use the capabilities of the UNIX 
operating system. 

A PROGRAM FOR MAINTAINING COMPUTER 
PROGRAMS "make"— Chapter 13 describes a software 
tool for maintaining, updating, and regenerating groups of 
computer programs. The many activities of program 
development and maintenance are made simpler by the 
make program. 
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SOURCE CODE CONTROL SYSTEM (SCCS) 
USER'S GUIDE- Chapter 14 describes the collection of 
SCCS programs provided under the UNIX operating 
system. The SCCS programs act as a "custodian" over the 
UNIX system files. 

"m4" MACRO PROCESSOR— Chapter 15 describes a 
general purpose macro processor that may be used as a 
front end for rational Fortran, C, and other programming 
languages. 

"awk" PROGRAMMING LANGUAGE- Chapter 16 
describes a software tool designed to make many common 
information retrieval and text manipulation tasks easy to 
state and to perform. 

LINK EDITOR- Chapter 17 describes a software tool 
(Id) that creates load files by combining object files, 
performing relocation, and resolving internal references. 

COMMON OBJECT FILE FORMAT "coff '- Chapter 
18 describes the output file produced on some UNIX 
systems by the assembler and the link editor. 

ARBITRARY PRECISION DESK CALCULATOR 
LANGUAGE "be"— Chapter 19 describes a compiler for 
doing arbitrary precision arithmetic on the UNIX operating 
system. 

INTERACTIVE DESK CALCULATOR "dc"- 

Chapter 20 describes a program implemented on the UNIX 
operating system to do arbitrary-precision integer 
arithmetic. 

LEXICAL ANALYZER GENERATOR "lex"- 

Chapter 21 describes a software tool that lexically 
processes character input streams. 

YET ANOTHER COMPILER-COMPILER "yacc"- 

Chapter 22 describes the yacc program. The yacc 
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program provides a general tool for imposing structure 
the input to a computer program. 



on 



UNIX SYSTEM TO UNIX SYSTEM COPY "uucp"- 

Chapter 23 describes a network that provides information 
exchange (between UNIX systems) over the direct distance 
dialing network. 

Some examples in this guide are based on the Document 
Preparation software which is available independently for the 
UNIX system. Make sure that the system has Document 
Preparation software available before trying any of those 
examples. 

Throughout this document, each reference of the form 
name(N), where possibly followed by a letter, refers to entry 
name in section N of the AT&T UNIX PC UNIX System V 
Manual. 

Normally when the system is ready for a command from a 
terminal, a prompt is displayed on the terminal (# by default). 
With certain commands, the system expects more than one line 
of terminal input. When this is the case, a secondary prompt is 
displayed (> by default). To avoid confusion with what the 
system displays and what the user types, this document does 
not show prompts displayed by the system unless noted 
otherwise. 
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Chapter 2 
C LANGUAGE 

LEXICAL CONVENTIONS 

There are six classes of tokens— identifiers, keywords, 
constants, strings, operators, and other separators. Blanks, 
tabs, new-lines, and comments (collectively, "white space") as 
described below are ignored except as they serve to separate 
tokens. Some white space is required to separate otherwise 
adjacent identifiers, keywords, and constants. 

If the input stream has been parsed into tokens up to a given 
character, the next token is taken to include the longest string 
of characters which could possibly constitute a token. 

Comments 

The characters /* introduce a comment which terminates with 
the characters */. Comments do not nest. 

Identifiers (Names) 

An identifier is a sequence of letters and digits. The first 
character must be a letter. The underscore (_) counts as a 
letter. Uppercase and lowercase letters are different. Although 
there is no limit on the length of a name, only initial characters 
are significant: at least eight characters of a non-external 
name, and perhaps fewer for external names. Moreover, some 
implementations may collapse case distinctions for external 
names. The external name sizes include: 
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PDP- 1 1 
VAX- 1 1 
AT&T 3B20 
AT&T UNIX PC 



7 characters, 2 cases 

>100 characters, 2 cases 

>100 characters, 2 cases 

>100 characters, 2 cases 



Keywords 

The following identifiers are reserved for use as keywords and 
may not be used otherwise: 



auto 


do 


goto 


short 


typedef 


break 


double 


if 


signed 


union 


case 


else 


int 


sizeof 


unsigned 


char 


enum 


long 


static 


void 


const 


external 


register 


struct 


volatile 


continue 


float 


return 


switch 


while 


default 


for 









This implementation reserves the word asm. 



Constants 

There are several kinds of constants. Each has a type; an 
introduction to types is given in "NAMES." Hardware 
characteristics that affect sizes are summarized in "Hardware 
Characteristics" under "LEXICAL CONVENTIONS." 



Integer Constants 

An integer constant consisting of a sequence of digits is taken 
to be octal if it begins with (digit zero). An octal constant 
consists of the digits through 7 only. A sequence of digits 
preceded by Ox or OX (digit zero) is taken to be a hexadecimal 
integer. The hexadecimal digits include a or A through f or F 
with values 10 through 15. Otherwise, the integer constant is 
taken to be decimal. A decimal constant whose value exceeds 
the largest signed machine integer is taken to be long; an octal 
or hex constant which exceeds the largest unsigned machine 
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integer is likewise taken to be long. Otherwise, integer 
constants are int. 



Explicit Long Constants 

A decimal, octal, or hexadecimal integer constant immediately 
followed by 1 (letter ell) or L is a long constant. As discussed 
below, on some machines integer and long values may be 
considered identical. 



Character Constants 

A character constant is a character enclosed in single quotes, as 
in 'x'. The value of a character constant is the numerical value 
of the character in the machine's character set. 



Certain nongraphic characters, the single quote (') and the 
backslash (\), may be represented according to the following 
table of escape sequences: 



escape 


ESC 


\e 


new-line 


NL (LF) 


\n 


horizontal tab 


HT 


\t 


vertical tab 


VT 


\v 


backspace 


BS 


\b 


carriage return 


CR 


\r 


form feed 


FF 


\f 


backslash 


\ 


\\ 


single quote 


I 


\' 


bit pattern 


ddd 


\ddd 


double quote 


II 


\" 



The escape \ddd consists of the backslash followed by 1, 2, or 3 
octal digits which are taken to specify the value of the desired 
character. A special case of this construction is \0 (not 
followed by a digit), which indicates the character NUL. If the 
character following a backslash is not one of those specified, 
the behavior is undefined. A new-line character is illegal in a 
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character constant. The type of a character constant is int. 



Floating Constants 

A floating constant consists of an integer part, a decimal point, 
a fraction part, an e or E, and an optionally signed integer 
exponent. The integer and fraction parts both consist of a 
sequence of digits. Either the integer part or the fraction part 
(not both) may be missing. Either the decimal point or the e 
and the exponent (not both) may be missing. 



Enumeration Constants 

Names declared as enumerators (see "Structure, Union, and 
Enumeration Declarations" under "DECLARATIONS") have 
type int. 



Strings 

A string is a sequence of characters surrounded by double 
quotes, as in "...". A string has type "array of char" and 
storage class static (see "NAMES") and is initialized with the 
given characters. The compiler places a null byte (\0) at the 
end of each string so that programs which scan the string can 
find its end. In a string, the double quote character (" ) must 
be preceded by a \; in addition, the same escapes as described 
for character constants may be used. 

A \ and the immediately following new-line are ignored. All 
strings, even when written identically, are distinct. 



Hardware Characteristics 

The following figures summarize certain hardware properties 
that vary from machine to machine. 
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DEC PDP-11 


(ASCII) 


char 


8 bits 


int 


16 


short 


16 


long 


32 


float 


32 


double 


64 


float range 


±10 ^^^ 


double range 


±10 ^'' 



Figure 2-1. DEC PDP-11 HARDWARE 
CHARACTERISTICS 



DEC VAX-11 


(ASCII) 


char 


8 bits 


int 


32 


short 


16 


long 


32 


float 


32 


double 


64 


float range 


±10 ^'' 


double range 


±10 ^'' 



Figure 2-2. DEC VAX-11 HARDWARE 
CHARACTERISTICS 
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AT&T UNIX PC 


AT&T 3B 


(ASCII) 


char 


8 bits 


int 


32 


short 


16 


long 


32 


float 


32 


double 


64 


float range 


±10 ^'' 


double range 


±10 ^''' 



Figure 2-3. AT&T UNIX PC/3B HARDWARE 
CHARACTERISTICS 



SYNTAX NOTATION 

Syntactic categories are indicated by italic type and literal 
words and characters in bold type. Alternative categories are 
listed on separate lines. An optional terminal or nonterminal 
symbol is indicated by the subscript "opt," so that 



{ expression } 

indicates an optional expression enclosed in braces. The syntax 
is summarized in "SYNTAX SUMMARY". 
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NAMES 

The C language bases the interpretation of an identifier upon 
two attributes of the identifier— its storage class and its type. 
The storage class determines the location and lifetime of the 
storage associated with an identifier; the type determines the 
meaning of the values found in the identifier's storage. 



Storage Class 

There are four declarable storage classes: 

• Automatic 

• Static 

• External 

• Register. 

Automatic variables are local to each invocation of a block (see 
"Compound Statement or Block" in "STATEMENTS") and are 
discarded upon exit from the block. Static variables are local to 
a block but retain their values upon reentry to a block even 
after control has left the block. External variables exist and 
retain their values throughout the execution of the entire 
program and may be used for communication between 
functions, even separately compiled functions. Register 
variables are (if possible) stored in the fast registers of the 
machine; like automatic variables, they are local to each block 
and disappear on exit from the block. 



Type 

The C language supports several fundamental types of objects. 
Objects declared as characters (char) are large enough to store 
any member of the implementation's character set. If a 
genuine character from that character set is stored in a char 
variable, its value is equivalent to the integer code for that 
character. Other quantities may be stored into character 
variables, but the implementation is machine dependent. In 
particular, char may be signed or unsigned by default. 
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Up to three sizes of integer, declared short int, int, and long 

int, are available. Longer integers provide no less storage than 
shorter ones, but the implementation may make either short 
integers or long integers, or both, equivalent to plain integers. 
"Plain" integers have the natural size suggested by the host 
machine architecture. The other sizes are provided to meet 
special needs. 

The properties of enum types (see "Structure, Union, and 
Enumeration Declarations" under "DECLARATIONS") are 
identical to those of some integer types. The implementation 
may use the range of values to determine how to allot storage. 



Unsigned integers, declared unsigned, obey the laws of 
arithmetic modulo 2 where n is the number of bits in the 
representation. (On the PDP-11, unsigned long quantities are 
not supported.) 



Single-precision floating point (float) and double precision 
floating point (double) may be synonymous in some 
implementations. 

Because objects of the foregoing types can usefully be 
interpreted as numbers, they will be referred to as arithmetic 
types. Char, int of all sizes whether unsigned or not, and 
enum will collectively be called integral types. The float and 
double types will collectively be called floating types. 

The void type specifies an empty set of values. It is used as 
the type returned by functions that generate no value. 

Besides the fundamental arithmetic types, there is a 
conceptually infinite class of derived types constructed from the 
fundamental types in the following ways: 



• Arrays of objects of most types 

• Functions which return objects of a given type 

2-8 



C LANGUAGE 



• Pointers to objects of a given type 

• Structures containing a sequence of objects of various 
types 

• Unions capable of containing any one of several objects 
of various types. 

In general these methods of constructing objects can be applied 
recursively. 



OBJECTS AND LVALUES 

An object is a manipulatable region of storage. An lvalue is an 
expression referring to an object. An obvious example of an 
lvalue expression is an identifier. There are operators which 
yield lvalues: for example, if E is an expression of pointer type, 
then *E is an lvalue expression referring to the object to which 
E points. The name "lvalue" comes from the assignment 
expression El = E2 in which the left operand El must be an 
lvalue expression. The discussion of each operator below 
indicates whether it expects lvalue operands and whether it 
yields an lvalue. 



CONVERSIONS 

A number of operators may, depending on their operands, cause 
conversion of the value of an operand from one type to another. 
This part explains the result to be expected from such 
conversions. The conversions demanded by most ordinary 
operators are summarized under ''Arithmetic Conversions." 
The summary will be supplemented as required by the 
discussion of each operator. 
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Characters and Integers 

A character or a short integer may be used wherever an integer 
may be used. In all cases the value is converted to an integer. 
Conversion of a shorter integer to a longer preserves sign. 
Whether or not sign-extension occurs for characters is machine 
dependent, but it is guaranteed that a member of the standard 
character set is non-negative. Of the machines treated here, 
only the PDP-11, VAX-11, and UNIX PC sign-extend. On these 
machines, char variables range in value from -128 to 127. The 
more explicit type unsigned char forces the values to range 
from to 255. 



On machines that treat characters as signed, the characters of 
the ASCII set are all non-negative. However, a character 
constant specified with an octal escape suffers sign extension 
and may appear negative; for example, '\377' has the value —1. 

When a longer integer is converted to a shorter integer or to a 
char, it is truncated on the left. Excess bits are simply 
discarded. 



Float and Double 

All floating arithmetic in C is carried out in double precision. 
Whenever a float appears in an expression it is lengthened to 
double by zero padding its fraction. When a double must be 
converted to float, for example by an assignment, the double 
is rounded before truncation to float length. This result is 
undefined if it cannot be represented as a float. 



Floating and Integral 

Conversions of floating values to integral type are rather 
machine dependent. In particular, the direction of truncation 
of negative numbers varies. The result is undefined if it will 
not fit in the space provided. Positive and negative floating 
point values are truncated to their integer portions. 
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1.1 -> 1 
1.9 -> 1 
-1.1 -> -1 
-1.9 -> -1 

Conversions of integral values to floating type are well behaved. 
Some loss of accuracy occurs if the destination lacks sufficient 
bits. 



Pointers and Integers 

An expression of integral type may be added to or subtracted 
from a pointer; in such a case, the first is converted as specified 
in the discussion of the addition operator. Two pointers to 
objects of the same type may be subtracted; in this case, the 
result is converted to an integer as specified in the discussion of 
the subtraction operator. 



Unsigned 

Whenever an unsigned integer and a plain integer are 
combined, the plain integer is converted to unsigned and the 
result is unsigned. The value is the least unsigned integer 
congruent to the signed integer (modulo 2^'^'"^^^^^). In a 2's 
complement representation, this conversion is conceptual; and 
there is no actual change in the bit pattern. 

When an unsigned short integer is converted to long, the 
value of the result is the same numerically as that of the 
unsigned integer. Thus the conversion amounts to padding with 
zeros on the left. 
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Arithmetic Conversions 

A great many operators cause conversions and yield result 
types in a similar way. This pattern will be called the "usual 
arithmetic conversions." 

1. First, any operands of type char or short are converted 
to int, and any operands of type unsigned char or 
unsigned short are converted to unsigned int. 

2. Then, if either operand is double, the other is converted 
to double and that is the type of the result. 

3. Otherwise, if either operand is float, the other is 
converted to float and that is the type of the result. 

4. Otherwise, if either operand is unsigned long, the other 
is converted to unsigned long and that is the type of 
the result. 

5. Otherwise, if either operand is long, the other is 
converted to long and that is the type of the result. 

6. Otherwise, if one operand is long, and the other is 
unsigned int, they are both converted to unsigned 
long and that is the type of the result. 

7. Otherwise, if either operand is unsigned, the other is 
converted to unsigned and that is the type of the result. 

8. Otherwise, both operands must be int, and that is the 
type of the result. 



Void 

The (nonexistent) value of a void object may not be used in 
any way, and neither explicit nor implicit conversion may be 
applied. Because a void expression denotes a nonexistent value, 
such an expression may be used only as an expression 
statement (see "Expression Statement" under 
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"STATEMENTS") or as the left operand of a comma expression 
(see "Comma Operator" under "EXPRESSIONS"). 

An expression may be converted to type void by use of a cast. 
For example, this makes explicit the discarding of the value of 
a function call used as an expression statement. 



EXPRESSIONS 

The precedence of expression operators is the same as the order 
of the major subsections of this section, highest precedence 
first. Thus, for example, the expressions referred to as the 
operands of + (see "Additive Operators") are those expressions 
defined under "Primary Expressions", "Unary Operators", and 
"Multiplicative Operators". Within each subpart, the operators 
have the same precedence. Left- or right-associativity is 
specified in each subsection for the operators discussed therein. 
The precedence and associativity of all the expression operators 
are summarized in the grammar of "SYNTAX SUMMARY". 

Otherwise, the order of evaluation of expressions is undefined. 
In particular, the compiler considers itself free to compute 
subexpressions in the order it believes most efficient even if the 
subexpressions involve side effects. The order in which 
subexpression evaluation takes place is unspecified. 
Expressions involving a commutative and associative operator 
(*, +, &, I , ) may be rearranged arbitrarily even in the 
presence of parentheses; to force a particular order of 
evaluation, an explicit temporary must be used. 

The handling of overflow and divide check in expression 
evaluation is undefined. Most existing implementations of C 
ignore integer overflows; treatment of division by and all 
floating-point exceptions varies between machines and is 
usually adjustable by a library function. 
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Primary Expressions 



Primary expressions involving ., — >, subscripting, and function 
calls group left to right. 



primary-expression: 
identifier 
constant 
string 

( expression ) 

primary-expression [ expression ] 
primary-expression ( expression-list ) 
primary-expression . identifier 
primary-expression -> identifier 

expression-list- 
expression 
expression-list , expression 

An identifier is a primary expression provided it has been 
suitably declared as discussed below. Its type is specified by its 
declaration. If the type of the identifier is "array of . . .", then 
the value of the identifier expression is a pointer to the first 
object in the array; and the type of the expression is "pointer to 
,..". Moreover, an array identifier is not an lvalue expression. 
Likewise, an identifier which is declared "function returning 
. . ,", when used except in the function-name position of a call, is 
converted to "pointer to function returning . . .". 

A constant is a primary expression. Its type may be int, long, 
or double depending on its form. Character constants have 
type int and floating constants have type double. 

A string is a primary expression. Its type is originally "array 
of char", but following the same rule given above for 
identifiers, this is modified to "pointer to char" and the result 
is a pointer to the first character in the string. (There is an 
exception in certain initializers; see "Initialization" under 
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"DECLARATIONS.") 

A parenthesized expression is a primary expression whose type 
and value are identical to those of the unadorned expression. 
The presence of parentheses does not affect whether the 
expression is an lvalue. 

A primary expression followed by an expression in square 
brackets is a primary expression. The intuitive meaning is that 
of a subscript. Usually, the primary expression has type 
"pointer to . . .", the subscript expression is int, and the type of 
the result is "...". The expression E1[E2] is identical (by 
definition) to *((E1)+(E2)). All the clues needed to 
understand this notation are contained in this subpart together 
with the discussions in "Unary Operators" and "Additive 
Operators" on identifiers, * and +, respectively. The 
implications are summarized under "Arrays, Pointers, and 
Subscripting" under "TYPES REVISITED." 

A function call is a primary expression followed by parentheses 
containing a possibly empty, comma-separated list of 
expressions which constitute the actual arguments to the 
function. The primary expression must be of type "function 
returning ...," and the result of the function call is of type 
"... ". As indicated below, a hitherto unseen identifier followed 
immediately by a left parenthesis is contextually declared to 
represent a function returning an integer; thus in the most 
common case, integer-valued functions need not be declared. 

Any actual arguments of type float are converted to double 
before the call. Any of type char or short are converted to 
int. Array names are converted to pointers. No other 
conversions are performed automatically; in particular, the 
compiler does not compare the types of actual arguments with 
those of formal arguments. If conversion is needed, use a cast; 
see "Unary Operators" and "Type Names" under 
"DECLARATIONS." 



2-15 



C LANGUAGE 



In preparing for the call to a function, a copy is made of each 
actual parameter. Thus, all argument passing in C is strictly 
by value. A function may change the values of its formal 
parameters, but these changes cannot affect the values of the 
actual parameters. It is possible to pass a pointer on the 
understanding that the function may change the value of the 
object to which the pointer points. An array name is a pointer 
expression. The order of evaluation of arguments is undefined 
by the language; take note that the various compilers differ. 
Recursive calls to any function are permitted. 

A primary expression followed by a dot followed by an 
identifier is an expression. The first expression must be a 
structure or a union, and the identifier must name a member of 
the structure or union. The value is the named member of the 
structure or union, and it is an lvalue if the first expression is 
an lvalue. 

A primary expression followed by an arrow (built from — and 
>) followed by an identifier is an expression. The first 
expression must be a pointer to a structure or a union and the 
identifier must name a member of that structure or union. The 
result is an lvalue referring to the named member of the 
structure or union to which the pointer expression points. Thus 
the expression El->MOS is the same as (*El).MOS. 
Structures and unions are discussed in "Structure, Union, and 
Enumeration Declarations" under "DECLARATIONS." 



Unary Operators 

Expressions with unary operators group right to left. 
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unary-expression: 
* expression 
& lvalue 
- expression 
! expression 
~ expression 
++ lvalue 
— lvalue 
lvalue ++ 
lvalue — 

( type-name ) expression 
sizeof expression 
sizeof ( type-name ) 

The unary * operator means indirection ; the expression must 
be a pointer, and the result is an lvalue referring to the object 
to which the expression points. If the type of the expression is 
"pointer to . . .," the type of the result is "... ". 

The result of the unary & operator is a pointer to the object 
referred to by the lvalue. If the type of the lvalue is "... ", the 
type of the result is "pointer to . . .". 

The result of the unary — operator is the negative of its 
operand. The usual arithmetic conversions are performed. The 
negative of an unsigned quantity is computed by subtracting its 
value from 2 where n is the number of bits in the 
corresponding signed type. 

There is no unary + operator. 

The result of the logical negation operator ! is one if the value 
of its operand is zero, zero if the value of its operand is 
nonzero. The type of the result is int. It is applicable to any 
arithmetic type or to pointers. 

The ' operator yields the one's complement of its operand. The 
usual arithmetic conversions are performed. The type of the 
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operand must be integral. 

The object referred to by the lvalue operand of prefix ++ is 
incremented. The value is the new value of the operand but is 
not an lvalue. The expression ++x is equivalent to x=x+l. 
See the discussions "Additive Operators" and "Assignment 
Operators" for information on conversions. 

The lvalue operand of prefix — is decremented in a similar 
manner: the expression — x is equivalent to x=x-l. 

When postfix ++ is applied to an lvalue, the result is the value 
of the object referred to by the lvalue. After the result is 
noted, the object is incremented in the same manner as for the 
prefix ++ operator. The type of the result is the same as the 
type of the lvalue expression. 

When postfix — is applied to an lvalue, the result is the value 
of the object referred to by the lvalue. After the result is 
noted, the object is decremented in the manner as for the prefix 
— operator. The type of the result is the same as the type of 
the lvalue expression. 

An expression preceded by the parenthesized name of a data 
type causes conversion of the value of the expression to the 
named type. This construction is called a cast. Type names are 
described in "Type Names" under "Declarations." 

The sizeof operator yields the size in bytes of its operand. (A 
byte is the space required to hold a char.) When applied to an 
array, the result is the total number of bytes in the array. The 
size is determined from the declarations of the objects in the 
expression. This expression is semantically an unsigned 
constant and may be used anywhere a constant is required. Its 
major use is in communication with routines like storage 
allocators and I/O systems. 
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The sizeoi operator may also be applied to a parenthesized 
type name. In that case it yields the size in bytes of an object 
of the indicated type. 

The construction sizeof (type) is taken to be a unit, so the 
expression sizeof(^^pe)— 2 is the same as {sizeof {type))— 2. 



Multiplicative Operators 

The multiplicative operators *, /, and % group left to right. 
The usual arithmetic conversions are performed. 



multiplicative expression: 

expression * expression 
expression / expression 
expression % expression 



The binary * operator indicates multiplication. The * operator 
is associative, and expressions with several multiplications at 
the same level may be rearranged by the compiler. The binary 
/ operator indicates division. 

The binary % operator yields the remainder from the division 
of the first expression by the second. The operands must be 
integral. 

When positive integers are divided, truncation is toward 0; but 
the form of truncation is machine-dependent if either operand 
is negative. On all machines covered by this manual, the 
remainder has the same sign as the dividend. It is always true 
that (a/b)*b + a%b is equal to a (if b is not 0). 
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Additive Operators 

The additive operators + and — group left to right. The usual 
arithmetic conversions are performed. There are some 
additional type possibilities for each operator. 

additive-expression: 

expression + expression 
expression - expression 

The result of the + operator is the sum of the operands. A 
pointer to an object in an array and a value of any integral 
type may be added. The latter is in all cases converted to an 
address offset by multiplying it by the length of the object to 
which the pointer points. The result is a pointer of the same 
type as the original pointer which points to another object in 
the same array, appropriately offset from the original object. 
Thus if P is a pointer to an object in an array, the expression 
P+1 is a pointer to the next object in the array. No further 
type combinations are allowed for pointers. 

The + operator is associative, and expressions with several 
additions at the same level may be rearranged by the compiler. 

The result of the — operator is the difference of the operands. 
The usual arithmetic conversions are performed. Additionally, 
a value of any integral type may be subtracted from a pointer, 
and then the same conversions for addition apply. 

If two pointers to objects of the same type are subtracted, the 
result is converted (by division by the length of the object) to 
an int representing the number of objects separating the 
pointed-to objects. This conversion will in general give 
unexpected results unless the pointers point to objects in the 
same array, since pointers, even to objects of the same type, do 
not necessarily differ by a multiple of the object length. 
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Shift Operators 

The shift operators « and » group left to right. Both 
perform the usual arithmetic conversions on their operands, 
each of which must be integral. Then the right operand is 
converted to int; the type of the result is that of the left 
operand. The result is undefined if the right operand is 
negative or greater than or equal to the length of the object in 
bits. 

shift-expression: 

expression « expression 
expression » expression 

The value of E1«E2 is El (interpreted as a bit pattern) left- 
shifted E2 bits. Vacated bits are filled. The value of 
E1»E2 is El right-shifted E2 bit positions. The right shift 
is guaranteed to be logical (0 fill) if El is unsigned; otherwise, 
it may be arithmetic. 



Relational Operators 

The relational operators group left to right. 

relational-expression: 

expression < expression 
expression > expression 
expression <= expression 
expression >= expression 

The operators < (less than), > (greater than), <= (less than or 
equal to), and >= (greater than or equal to) all yield if the 
specified relation is false and 1 if it is true. The type of the 
result is int. The usual arithmetic conversions are performed. 
Two pointers may be compared; the result depends on the 
relative locations in the address space of the pointed-to objects. 
Pointer comparison is portable only when the pointers point to 
objects in the same array. 
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Equality Operators 

equality-expression: 

expression == expression 
expression /= expression 

The == (equal to) and the != (not equal to) operators are 
exactly analogous to the relational operators except for their 
lower precedence. (Thus a<b == c<d is 1 whenever a<b and 
c<d have the same truth value). 

A pointer may be compared to an integer only if the integer is 
the constant 0. A pointer to which has been assigned is 
guaranteed not to point to any object and will appear to be 
equal to 0. In conventional usage, such a pointer is considered 
to be null. 



Bitwise AND Operator 

and-expression: 

expression & expression 

The & operator is associative, and expressions involving & may 
be rearranged. The usual arithmetic conversions are 
performed. The result is the bitwise AND function of the 
operands. The operator applies only to integral operands. 



Bitwise Exclusive OR Operator 

exclusive-or-expression: 

expression " expression 

The operator is associative, and expressions involving may 
be rearranged. The usual arithmetic conversions are 
performed; the result is the bitwise exclusive OR function of the 
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operands. The operator applies only to integral operands. 



Bitwise Inclusive OR Operator 

inclusive-or-expression: 

expression \ expression 

The I operator is associative, and expressions involving | may 
be rearranged. The usual arithmetic conversions are 
performed; the result is the bitwise inclusive OR function of its 
operands. The operator applies only to integral operands. 

Logical AND Operator 

logical-and-expression: 

expression && expression 

The && operator groups left to right. It returns 1 if both its 
operands evaluate to nonzero, otherwise. Unlike &, && 
guarantees left to right evaluation; moreover, the second 
operand is not evaluated if the first operand is 0. 

The operands need not have the same type, but each must have 
one of the fundamental types or be a pointer. The result is 
always int. 



Logical OR Operator 

logical-or-expression: 

expression \ \ expression 

The I I operator groups left to right. It returns 1 if either of 
its operands evaluate to nonzero, otherwise. Unlike | , | | 
guarantees left to right evaluation; moreover, the second 

2-23 



C LANGUAGE 



operand is not evaluated if the value of the first operand is 
nonzero. 

The operands need not have the same type, but each must have 
one of the fundamental types or be a pointer. The result is 
always int. 



Conditional Operator 

conditional-expression: 

expression ? expression : expression 

Conditional expressions group right to left. The first 
expression is evaluated; and if it is nonzero, the result is the 
value of the second expression, otherwise that of third 
expression. If possible, the usual arithmetic conversions are 
performed to bring the second and third expressions to a 
common type. If both are structures or unions of the same 
type, the result has the type of the structure or union. If both 
pointers are of the same type, the result has the common type. 
Otherwise, one must be a pointer and the other the constant 0, 
and the result has the type of the pointer. Only one of the 
second and third expressions is evaluated. 



Assignment Operators 

There are a number of assignment operators, all of which group 
right to left. All require an lvalue as their left operand, and 
the type of an assignment expression is that of its left operand. 
The value is the value stored in the left operand after the 
assignment has taken place. The two parts of a compound 
assignment operator are separate tokens. 
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assignment-expression: 
lvalue = expression 
lvalue += expression 
lvalue -= expression 
lvalue *= expression 
lvalue /= expression 
lvalue %= expression 
lvalue »= expression 
lvalue «= expression 
lvalue &= expression 
lvalue '= expression 
lvalue I = expression 

In the simple assignment with =, the value of the expression 
replaces that of the object referred to by the lvalue. If both 
operands have arithmetic type, the right operand is converted 
to the type of the left preparatory to the assignment. Second, 
both operands may be structures or unions of the same type. 
Finally, if the left operand is a pointer, the right operand must 
in general be a pointer of the same type. However, the 
constant may be assigned to a pointer; it is guaranteed that 
this value will produce a null pointer distinguishable from a 
pointer to any object. 

The behavior of an expression of the form El op = E2 may be 
inferred by taking it as equivalent to El = El op (E2); 
however. El is evaluated only once. In += and — =, the left 
operand may be a pointer; in which case, the (integral) right 
operand is converted as explained in "Additive Operators." All 
right operands and all nonpointer left operands must have 
arithmetic type. 



Comma Operator 



comma-expression: 

expression , expression 
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A pair of expressions separated by a comma is evaluated left to 
right, and the value of the left expression is discarded. The 
type and value of the result are the type and value of the right 
operand. This operator groups left to right. In contexts where 
comma is given a special meaning, e.g., in lists of actual 
arguments to functions (see "Primary Expressions") and lists of 
initializers (see "Initialization" under "DECLARATIONS"), the 
comma operator as described in this subpart can only appear in 
parentheses. For example, 

f(a, (t=3, t+2), c) 

has three arguments, the second of which has the value 5. 



DECLARATIONS 

Declarations are used to specify the interpretation which C 
gives to each identifier; they do not necessarily reserve storage 
associated with the identifier. Declarations have the form 



declaration: 

decl-specifiers declarator-list 

opt 

The declarators in the declarator-list contain the identifiers 
being declared. The decl-specifiers consist of a sequence of type 
and storage class specifiers. 

decl-specifiers: 

type-specifier decl-specifiers 
sc-specifier decl-specifiers 

The list must be self-consistent in a way described below. 
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Storage Class Specifiers 

The sc-specifiers are: 

sc-specifier: 
auto 
static 
extern 
register 
typedef 

The typedef specifier does not reserve storage and is called a 
"storage class specifier" only for syntactic convenience. See 
"Typedef" for more information. The meanings of the various 
storage classes were discussed in "Names." 

The auto, static, and register declarations also serve as 
definitions in that they cause an appropriate amount of storage 
to be reserved. In the extern case, there must be an external 
definition (see "External Definitions") for the given identifiers 
somewhere outside the function in which they are declared. 

A register declaration is best thought of as an auto 
declaration, together with a hint to the compiler that the 
variables declared will be heavily used. Only the first few such 
declarations in each function are effective. Moreover, only 
variables of certain types will be stored in registers; on the 
PDP-11, they are int or pointer. One other restriction applies 
to register variables: the address-of operator & cannot be 
applied to them. Smaller, faster programs can be expected if 
register declarations are used appropriately, but future 
improvements in code generation may render them 
unnecessary. 

At most, one sc-specifier may be given in a declaration. If the 
sc-specifier is missing from a declaration, it is taken to be auto 
inside a function, extern outside. Exception: functions are 
never automatic. 
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Type Specifiers 

The type-specifiers are 

type-specifier: 

struct-or-union-specifier 

typedef-name 

enum-specifier 
basic-type-specifier: 

basic-type 

basic-type basic-type-specifiers 
basic-type: 

char 

short 

int 

long 

unsigned 

float 

double 

void 

At most one of the words long or short may be specified in 
conjunction with int; the meaning is the same as if int were 
not mentioned. The word long may be specified in conjunction 
with float; the meaning is the same as double. The word 
unsigned may be specified alone, or in conjunction with int or 
any of its short or long varieties, or with char. 

Otherwise, at most one type-specifier may be given in a 
declaration. In particular, adjectival use of long, short, or 
unsigned is not permitted with typedef names. If the type- 
specifier is missing from a declaration, it is taken to be int. 

Specifiers for structures, unions, and enumerations are 
discussed in "Structure, Union, and Enumeration Declarations." 
Declarations with typedef names are discussed in "Typedef." 
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Declarators 

The declarator-list appearing in a declaration is a comma- 
separated sequence of declarators, each of which may have an 
initializer. 



declarator-list: 

init-declarator 
init-declarator , declarator-list 

init-declarator: 

declarator initializer ^ 
opt 

Initializers are discussed in "Initialization". The specifiers in 
the declaration indicate the type and storage class of the 
objects to which the declarators refer. Declarators have the 
syntax: 

declarator: 

identifier 

( declarator ) 

* declarator 

declarator () 

declarator [ constant-expression ] 

The grouping is the same as in expressions. 



Meaning of Declarators 

Each declarator is taken to be an assertion that when a 
construction of the same form as the declarator appears in an 
expression, it yields an object of the indicated type and storage 
class. 

Each declarator contains exactly one identifier; it is this 
identifier that is declared. If an unadorned identifier appears 
as a declarator, then it has the type indicated by the specifier 
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heading the declaration. 

A declarator in parentheses is identical to the unadorned 
declarator, but the binding of complex declarators may be 
altered by parentheses. See the examples below. 

Now imagine a declaration 

TDl 

where T is a type-specifier (like int, etc.) and Dl is a 
declarator. Suppose this declaration makes the identifier have 
type "... T ," where the "..." is empty if Dl is just a plain 
identifier (so that the type of x in *int x" is just int). Then if 
Dl has the form 

*D 

the type of the contained identifier is "... pointer to T ." 

If Dl has the form 

DO 

then the contained identifier has the type "... function 
returning T." 

If Dl has the form 

Dlconstant-expression] 

or 
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D[ 



then the contained identifier has type "... array of T." In the 
first case, the constant expression is an expression whose value 
is determinable at compile time , whose type is int, and whose 
value is positive. (Constant expressions are defined precisely in 
"Constant Expressions.") When several "array of" 
specifications are adjacent, a multidimensional array is created; 
the constant expressions which specify the bounds of the arrays 
may be missing only for the first member of the sequence. This 
elision is useful when the array is external and the actual 
definition, which allocates storage, is given elsewhere. The first 
constant expression may also be omitted when the declarator is 
followed by initialization. In this case the size is calculated 
from the number of initial elements supplied. 

An array may be constructed from one of the basic types, from 
a pointer, from a structure or union, or from another array (to 
generate a multidimensional array). 

Not all the possibilities allowed by the syntax above are 
actually permitted. The restrictions are as follows: functions 
may not return arrays or functions although they may return 
pointers; there are no arrays of functions although there may 
be arrays of pointers to functions. Likewise, a structure or 
union may not contain a function; but it may contain a pointer 
to a function. 

As an example, the declaration 

int i, *ip, f(), *fip(), (*pfi)(); 

declares an integer i, a pointer ip to an integer, a function f 
returning an integer, a function fip returning a pointer to an 
integer, and a pointer pfi to a function which returns an 
integer. It is especially useful to compare the last two. The 
binding of *fip() is *(fip()). The declaration suggests, and the 
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same construction in an expression requires, the calling of a 
function fip. Using indirection through the (pointer) result to 
yield an integer. In the declarator (*pfi)(), the extra 
parentheses are necessary, as they are also in an expression, to 
indicate that indirection through a pointer to a function yields 
a function, which is then called; it returns an integer. 

As another example, 

float fa[17], *afp[17]; 

declares an array of float numbers and an array of pointers to 
float numbers. Finally, 

static int x3d[3][5][7]; 

declares a static 3-dimensional array of integers, with rank 
3X5X7. In complete detail, x3d is an array of three items; each 
item is an array of five arrays; each of the latter arrays is an 
array of seven integers. Any of the expressions x3d, x3d[i], 
x3d[i][j], x3d[i][j][k] may reasonably appear in an expression. 
The first three have type "array" and the last has type int. 



Structure and Union Declarations 

A structure is an object consisting of a sequence of named 
members. Each member may have any type. A union is an 
object which may, at a given time, contain any one of several 
members. Structure and union specifiers have the same form. 



struct-or-union-specifier: 

struct-or-union { struct-decl-list } 
struct-or-union identifier { struct-decl-list } 
struct-or-union identifier 
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struct-or-union: 
struct 
union 

The struct-decl-list is a sequence of declarations for the 
members of the structure or union: 

struct-decl-list: 

struct-declaration 
struct-declaration struct-decl-list 

struct-declaration: 

type-specifier struct-declarator-list ; 

struct-declarator-list: 
struct-declarator 
struct-declarator , struct-declarator-list 

In the usual case, a struct-declarator is just a declarator for a 
member of a structure or union. A structure member may also 
consist of a specified number of bits. Such a member is also 
called a field ; its length, a non-negative constant expression, is 
set off from the field name by a colon. 

struct-declarator: 
declarator 

declarator : constant-expression 
: constant-expression 

Within a structure, the objects declared have addresses which 
increase as the declarations are read left to right. Each 
nonfield member of a structure begins on an addressing 
boundary appropriate to its type; therefore, there may be 
unnamed holes in a structure. Field members are packed into 
machine integers; they do not straddle words. A field which 
does not fit into the space remaining in a word is put into the 
next word. No field may be wider than a word. 
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Fields are assigned right to left on the PDP-11 and VAX-11, 
left to right on the 3B20. 

A struct-declarator with no declarator, only a colon and a 
width, indicates an unnamed field useful for padding to 
conform to externally-imposed layouts. As a special case, a 
field with a width of specifies alignment of the next field at 
an implementation dependent boundary. 

The language does not restrict the types of things that are 
declared as fields, but implementations are not required to 
support any but integer fields. Moreover, even int fields may 
be considered to be unsigned. On the UNIX PC and PDP-11, 
fields are not signed and have only integer values; on the 
VAX-11, fields declared with int are treated as containing a 
sign. For these reasons, it is strongly recommended that fields 
be declared as unsigned. In all implementations, there are no 
arrays of fields, and the address-of operator & may not be 
applied to them, so that there are no pointers to fields. 

A union may be thought of as a structure all of whose members 
begin at offset and whose size is sufficient to contain any of 
its members. At most, one of the members can be stored in a 
union at any time. 

A structure or union specifier of the second form, that is, one of 

struct identifier { struct-decl-list } 
union identifier { struct-decl-list } 

declares the identifier to be the structure tag (or union tag) of 
the structure specified by the list. A subsequent declaration 
may then use the third form of specifier, one of 

struct identifier 
union identifier 
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Structure tags allow definition of self-referential structures. 
Structure tags also permit the long part of the declaration to be 
given once and used several times. It is illegal to declare a 
structure or union which contains an instance of itself, but a 
structure or union may contain a pointer to an instance of 
itself. 

The third form of a structure or union specifier may be used 
prior to a declaration which gives the complete specification of 
the structure or union in situations in which the size of the 
structure or union is unnecessary. The size is unnecessary in 
two situations: when a pointer to a structure or union is being 
declared and when a typedef name is declared to be a 
synonym for a structure or union. This, for example, allows the 
declaration of a pair of structures which contain pointers to 
each other. 

The names of members and tags do not conflict with each other 
or with ordinary variables. A particular name may not be used 
twice in the same structure, but the same name may be used in 
several different structures in the same scope. 

A simple but important example of a structure declaration is 
the following binary tree structure: 

struct tnode 

{ 

char tword[20]; 

int count; 

struct tnode *left; 

struct tnode *right; 

}; 

which contains an array of 20 characters, an integer, and two 
pointers to similar structures. Once this declaration has been 
given, the declaration 
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struct tnode s, *sp; 

declares s to be a structure of the given sort and sp to be a 
pointer to a structure of the given sort. With these 
declarations, the expression 

sp— >count 

refers to the count field of the structure to which sp points; 

s.left 
refers to the left subtree pointer of the structure s; and 

s.right— >tword[0] 

refers to the first character of the tword member of the right 
subtree of s. 

Enumeration Declarations 

Enumeration variables and constants have integral type. 

enum-specifier: 

enum { enum-list } 

enum identifier { enum-list } 

enum identifier 

enum-list: 

enumerator 
enum-list , enumerator 

enumerator: 
identifier 
identifier = constant-expression 
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The identifiers in an enum-list are declared as constants and 
may appear wherever constants are required. If no 
enumerators with = appear, then the values of the 
corresponding constants begin at and increase by 1 as the 
declaration is read from left to right. An enumerator with = 
gives the associated identifier the value indicated; subsequent 
identifiers continue the progression from the assigned value. 

The names of enumerators in the same scope must all be 
distinct from each other and from those of ordinary variables. 

The role of the identifier in the enum-specifier is entirely 
analogous to that of the structure tag in a struct-specifier; it 
names a particular enumeration. For example, 

enum color { green, burgundy, claret=20, winedark }; 

enum color *cp, col; 

col = claret; 
cp = &col; 

if (*cp == burgundy) ... 

makes color the enumeration-tag of a type describing various 
colors, and then declares cp as a pointer to an object of that 
type, and col as an object of that type. The possible values are 
drawn from the set {0,1,20,21}. 



Initialization 

A declarator may specify an initial value for the identifier 
being declared. The initializer is preceded by = and consists of 
an expression or a list of values nested in braces. 
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initializer: 

= expression 

= { initializer-list } 

= { initializer-list , } 

initializer-list: 
expression 

initializer-list , initializer-list 
{ initializer-list } 
{ initializer-list , } 

All the expressions in an initializer for a static or external 
variable must be constant expressions, which are described in 
"CONSTANT EXPRESSIONS", or expressions which reduce to 
the address of a previously declared variable, possibly offset by 
a constant expression. Automatic or register variables may be 
initialized by arbitrary expressions involving constants and 
previously declared variables and functions. 

Static and external variables that are not initialized are 
guaranteed to start off as zero. Automatic and register 
variables that are not initialized are guaranteed to start off as 
garbage. 

When an initializer applies to a scalar (a pointer or an object of 
arithmetic type), it consists of a single expression, perhaps in 
braces. The initial value of the object is taken from the 
expression; the same conversions as for assignment are 
performed. 

When the declared variable is an aggregate (a structure or 
array), the initializer consists of a brace-enclosed, comma- 
separated list of initializers for the members of the aggregate 
written in increasing subscript or member order. If the 
aggregate contains subaggregates, this rule applies recursively 
to the members of the aggregate. If there are fewer initializers 
in the list than there are members of the aggregate, then the 
aggregate is padded with zeros. It is not permitted to initialize 
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unions or automatic aggregates. 

Braces may in some cases be omitted. If the initializer begins 
with a left brace, then the succeeding comma-separated list of 
initializers initializes the members of the aggregate; it is 
erroneous for there to be more initializers than members. If, 
however, the initializer does not begin with a left brace, then 
only enough elements from the list are taken to account for the 
members of the aggregate; any remaining members are left to 
initialize the next member of the aggregate of which the 
current aggregate is a part. 

A final abbreviation allows a char array to be initialized by a 
string. In this case successive characters of the string initialize 
the members of the array. 

For example, 

int x[] = { 1, 3, 5 }; 

declares and initializes x as a one-dimensional array which has 
three members, since no size was specified and there are three 
initializers. 

float y[4][3] = 

{ 

{ 1, 3, 5 }, 

{ 2, 4, 6 }, 

{ 3, 5, 7 }, 

}; 

is a completely-bracketed initialization: 1, 3, and 5 initialize the 
first row of the array y[0], namely y[0][0], y[0][l], and y[0][2]. 
Likewise, the next two lines initialize y[l] and y[2]. The 
initializer ends early and therefore y[3] is initialized with 0. 
Precisely, the same effect could have been achieved by 
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float y[4][3] = 

{ 

1, 3, 5, 2, 4, 6, 3, 5, 7 

}; 

The initializer for y begins with a left brace but that for y[0] 
does not; therefore, three elements from the list are used. 
Likewise, the next three are taken successively for y[l] and 
y[2]. Also, 

float y[4][3] = 

{ 

{ 1 }, { 2 }, { 3 }, { 4 } 

}; 

initializes the first column of y (regarded as a two-dimensional 
array) and leaves the rest 0. 

Finally, 

char msg[] = " Syntax error on line %s\n" ; 

shows a character array whose members are initialized with a 
string. 

Type Names 

In two contexts (to specify type conversions explicitly by means 
of a cast and as an argument of sizeof), it is desired to supply 
the name of a data type. This is accomplished using a "type 
name", which in essence is a declaration for an object of that 
type which omits the name of the object. 

type-name: 

type-specifier abstract-declarator 
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abstract-declarator: 
empty 

( abstract-declarator ) 
* abstract-declarator 
abstract-declarator () 
abstract-declarator [ constant-expression ] 

To avoid ambiguity, in the construction 

( abstract-declarator ) 

the abstract-declarator is required to be nonempty. Under this 
restriction, it is possible to identify uniquely the location in the 
abstract-declarator where the identifier would appear if the 
construction were a declarator in a declaration. The named 
type is then the same as the type of the hypothetical identifier. 
For example, 

int 
int * 

int *[3] 
int (*)[3] 
int *() 
int(*)() 
int (*[3])() 

name respectively the types "integer," "pointer to integer," 
"array of three pointers to integers," "pointer to an array of 
three integers," "function returning pointer to integer," 
"pointer to function returning an integer," and "array of three 
pointers to functions returning an integer." 
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Typedef 

Declarations whose "storage class" is typedef do not define 
storage but instead define identifiers which can be used later as 
if they were type keywords naming fundamental or derived 
types. 

typedef-name: 
identifier 

Within the scope of a declaration involving typedef, each 
identifier appearing as part of any declarator therein becomes 
syntactically equivalent to the type keyword naming the type 
associated with the identifier in the way described in "Meaning 
of Declarators." For example, after 

typedef int MILES, *KLICKSP; 

typedef struct { double re, im; } complex; 

the constructions 

MILES distance; 

extern KLICKSP metricp; 

complex z, *zp; 

are all legal declarations; the type of distance is int, that of 
metricp is "pointer to int," and that of z is the specified 
structure. The zp is a pointer to such a structure. 

The typedef does not introduce brand-new types, only 
synonyms for types which could be specified in another way. 
Thus in the example above distance is considered to have 
exactly the same type as any other int object. 
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STATEMENTS 

Except as indicated, statements are executed in sequence. 

Expression Statement 

Most statements are expression statements, which have the 
form 

expression ; 

Usually expression statements are assignments or function 
calls. 



Compound Statement or Block 

So that several statements can be used where one is expected, 
the compound statement (also, and equivalently, called "block") 
is provided: 



compound-statement: 

( declaration-list , statement-list ^ ] 
^ opt opt ' 

declaration-list: 
declaration 
declaration declaration-list 

statement-list: 
statement 
statement statement-list 

If any of the identifiers in the declaration-list were previously 
declared, the outer declaration is pushed down for the duration 
of the block, after which it resumes its force. 
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Any initializations of auto or register variables are 
performed each time the block is entered at the top. It is 
currently possible (but a bad practice) to transfer into a block; 
in that case the initializations are not performed. 
Initializations of static variables are performed only once 
when the program begins execution. Inside a block, extern 
declarations do not reserve storage so initialization is not 
permitted. 



Conditional Statement 

The two forms of the conditional statement are 

if ( expression ) statement 

if ( expression ) statement else statement 

In both cases, the expression is evaluated; and if it is nonzero, 
the first substatement is executed. In the second case, the 
second substatement is executed if the expression is 0. The 
"else" ambiguity is resolved by connecting an else with the last 
encountered else-less if. 



While Statement 

The while statement has the form 

while ( expression ) statement 

The substatement is executed repeatedly so long as the value of 
the expression remains nonzero. The test takes place before 
each execution of the statement. 
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Do Statement 

The do statement has the form 

do statement while ( expression ) ; 

The substatement is executed repeatedly until the value of the 
expression becomes 0. The test takes place after each execution 
of the statement. 

For Statement 

The for statement has the form: 



for ( exp-1 , ; exp-2 ; exp-3 ) statement 



Except for the behavior of continue, this statement is 
equivalent to 

exp-1 ; 

while ( exp-2 ) 

{ 

statement 

exp-3 ; 

} 

Thus the first expression specifies initialization for the loop; 
the second specifies a test, made before each iteration, such 
that the loop is exited when the expression becomes 0. The 
third expression often specifies an incrementing that is 
performed after each iteration. 

Any or all of the expressions may be dropped. A missing exp-2 
makes the implied while clause equivalent to while(l); other 
missing expressions are simply dropped from the expansion 
above. 
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Switch Statement 

The switch statement causes control to be transferred to one 
of several statements depending on the value of an expression. 
It has the form 

switch ( expression ) statement 

The usual arithmetic conversion is performed on the expression, 
but the result must be int. The statement is typically 
compound. Any statement within the statement may be labeled 
with one or more case prefixes as follows: 

case constant-expression : 

where the constant expression must be int. No two of the case 
constants in the same switch may have the same value. 
Constant expressions are precisely defined in "CONSTANT 
EXPRESSIONS." 

There may also be at most one statement prefix of the form 

default : 

When the switch statement is executed, its expression is 
evaluated and compared with each case constant. If one of the 
case constants is equal to the value of the expression, control is 
passed to the statement following the matched case prefix. If 
no case constant matches the expression and if there is a 
default, prefix, control passes to the prefixed statement. If no 
case matches and if there is no default, then none of the 
statements in the switch is executed. 

The prefixes case and default do not alter the flow of control, 
which continues unimpeded across such prefixes. To exit from 
a switch, see "Break Statement." 
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Usually, the statement that is the subject of a switch is 
compound. Declarations may appear at the head of this 
statement, but initializations of automatic or register variables 
are ineffective. 



Break Statement 

The statement 

break ; 

causes termination of the smallest enclosing while, do, for, or 
switch statement; control passes to the statement following 
the terminated statement. 

Continue Statement 

The statement 

continue ; 

causes control to pass to the loop-continuation portion of the 
smallest enclosing while, do, or for statement; that is to the 
end of the loop. More precisely, in each of the statements 

while (...) do for (...) 

{ { { 

contin: ; contin: ; contin: ; 

} } while (...); } 

a continue is equivalent to goto contin. (Following the 
contin: is a null statement, see "Null Statement".) 
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Return Statement 



A function returns to its caller by means of the return 
statement which has one of the forms 



return ; 

return expression ; 

In the first case, the returned value is undefined. In the second 
case, the value of the expression is returned to the caller of the 
function. If required, the expression is converted, as if by 
assignment, to the type of function in which it appears. 
Flowing off the end of a function is equivalent to a return with 
no returned value. The expression may be parenthesized. 



Goto Statement 

Control may be transferred unconditionally by means of the 
statement 



goto identifier ; 

The identifier must be a label (see "Labeled Statement") 
located in the current function. 



Labeled Statement 

Any statement may be preceded by label prefixes of the form 

identifier : 

which serve to declare the identifier as a label. The only use of 
a label is as a target of a goto. The scope of a label is the 
current function, excluding any subblocks in which the same 
identifier has been redeclared. See "SCOPE RULES." 
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Null Statement 

The null statement has the form 



A null statement is useful to carry a label just before the } of a 
compound statement or to supply a null body to a looping 
statement such as while. 



EXTERNAL DEFINITIONS 

A C program consists of a sequence of external definitions. An 
external definition declares an identifier to have storage class 
extern (by default) or perhaps static, and a specified type. 
The type-specifier (see "Type Specifiers" in 
"DECLARATIONS") may also be empty, in which case the type 
is taken to be int. The scope of external definitions persists to 
the end of the file in which they are declared just as the effect 
of declarations persists to the end of a block. The syntax of 
external definitions is the same as that of all declarations 
except that only at this level may the code for functions be 
given. 



External Function Definitions 

Function definitions have the form 



function-definition: 

decl-specifiers function-declarator function-body 

The only sc-specifiers allowed among the decl-specifiers are 
extern or static; see "Scope of Externals" in "SCOPE 
RULES" for the distinction between them. A function 
declarator is similar to a declarator for a "function returning 
..." except that it lists the formal parameters of the function 
being defined. 
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function-declarator: 

declarator ( parameter-list ) 

parameter-list: 
identifier 
identifier , parameter-list 



The function-body has the form 



function-body: 

declaration-list ^ compound-statement 
opt 

The identifiers in the parameter list, and only those identifiers, 
may be declared in the declaration list. Any identifiers whose 
type is not given are taken to be int. The only storage class 
which may be specified is register; if it is specified, the 
corresponding actual parameter will be copied, if possible, into 
a register at the outset of the function. 



A simple example of a complete function definition is 

int inax(a, b, c) 
int a, b, c; 



{ 



} 



int m; 

m = (a > b) ? a : b; 
return((m > c) ? m : c); 



Here int is the type-specifier; niax(a, b, c) is the function- 
declarator; int a, b, c; is the declaration-list for the formal 
parameters; { ... } is the block giving the code for the 
statement. 
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The C program converts all float actual parameters to double, 
so formal parameters declared float have their declaration 
adjusted to read double. All char and short formal 
parameter declarations are similarly adjusted to read int. 
Also, since a reference to an array in any context (in particular 
as an actual parameter) is taken to mean a pointer to the first 
element of the array, declarations of formal parameters 
declared "array of ... " are adjusted to read "pointer to . . .". 



External Data Definitions 

An external data definition has the form 

data-definition: 
declaration 

The storage class of such data may be extern (which is the 
default) or static but not auto or register. 



SCOPE RULES 

A C program need not all be compiled at the same time. The 
source text of the program may be kept in several files, and 
precompiled routines may be loaded from libraries. 
Communication among the functions of a program may be 
carried out both through explicit calls and through 
manipulation of external data. 

Therefore, there are two kinds of scopes to consider: first, what 
may be called the lexical scope of an identifier, which is 
essentially the region of a program during which it may be 
used without drawing ''undefined identifier" diagnostics; and 
second, the scope associated with external identifiers, which is 
characterized by the rule that references to the same external 
identifier are references to the same object. 
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Lexical Scope 

The lexical scope of identifiers declared in external definitions 
persists from the definition through the end of the source file 
in which they appear. The lexical scope of identifiers which are 
formal parameters persists through the function with which 
they are associated. The lexical scope of identifiers declared at 
the head of a block persists until the end of the 'block. The 
lexical scope of labels is the whole of the function in which they 
appear. 

In all cases, however, if an identifier is explicitly declared at 
the head of a block, including the block constituting a function, 
any declaration of that identifier outside the block is suspended 
until the end of the block. 

Remember also (see "Structure, Union, and Enumeration 
Declarations" in "DECLARATIONS") that tags, identifiers 
associated with ordinary variables, and identities associated 
with structure and union members form three disjoint classes 
which do not conflict. Members and tags follow the same scope 
rules as other identifiers. The enum constants are in the same 
class as ordinary variables and follow the same scope rules. 
The typedef names are in the same class as ordinary 
identifiers. They may be redeclared in inner blocks, but an 
explicit type must be given in the inner declaration: 

typedef float distance; 

{* 

auto int distance; 



The int must be present in the second declaration, or it would 
be taken to be a declaration with no declarators and type 
distance. 



2-52 



C LANGUAGE 



Scope of Externals 

If a function refers to an identifier declared to be extern, then 
somewhere among the files or libraries constituting the 
complete program there must be at least one external definition 
for the identifier. All functions in a given program which refer 
to the same external identifier refer to the same object, so care 
must be taken that the type and size specified in the definition 
are compatible with those specified by each function which 
references the data. 



It is illegal to explicitly initialize any external identifier more 
than once in the set of files and libraries comprising a multi- 
file program. It is legal to have more than one data definition 
for any external non-function identifier; explicit use of extern 
does not change the meaning of an external declaration. 

In restricted environments, the use of the extern storage class 
takes on an additional meaning. In these environments, the 
explicit appearance of the extern keyword in external data 
declarations of identities without initialization indicates that 
the storage for the identifiers is allocated elsewhere, either in 
this file or another file. It is required that there be exactly one 
definition of each external identifier (without extern) in the 
set of files and libraries comprising a multi-file program. 

Identifiers declared static at the top level in external 
definitions are not visible in other files. Functions may be 
declared static. 



COMPILER CONTROL LINES 

The C compiler contains a preprocessor capable of macro 
substitution, conditional compilation, and inclusion of named 
files. Lines beginning with # communicate with this 
preprocessor. There may be any number of blanks and 
horizontal tabs between the # and the directive. These lines 
have syntax independent of the rest of the language; they may 

2-53 



C LANGUAGE 



appear anywhere and have effect which lasts (independent of 
scope) until the end of the source program file. 



Token Replacement 

A compiler-control line of the form 

#define identifier token-string 

causes the preprocessor to replace subsequent instances of the 
identifier with the given string of tokens. Semicolons in or at 
the end of the token-string are part of that string. A line of 
the form 



#define identifierfidentifier, ... )token-string 

where there is no space between the first identifier and the (, is 
a macro definition with arguments. There may be zero or more 
formal parameters. Subsequent instances of the first identifier 
followed by a (, a sequence of tokens delimited by commas, and 
a ) are replaced by the token string in the definition. Each 
occurrence of an identifier mentioned in the formal parameter 
list of the definition is replaced by the corresponding token 
string from the call. The actual arguments in the call are token 
strings separated by commas; however, commas in quoted 
strings or protected by parentheses do not separate arguments. 
The number of formal and actual parameters must be the same. 
Strings and character constants in the token-string are scanned 
for formal parameters, but strings and character constants in 
the rest of the program are not scanned for defined identifiers 
to replacement. 

In both forms the replacement string is rescanned for more 
defined identifiers. In both forms a long definition may be 
continued on another line by writing \ at the end of the line to 
be continued. 
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This facility is most valuable for defining constants in order to 
improve the code's readability. For example: 

#define TABSIZE 100 
int table[TABSIZE]; 

A control line of the form 

#undef identifier 

causes the identifier's preprocessor definition (if any) to be 
forgotten. 

If a ^defined identifier is the subject of a subsequent ^define 
with no intervening #undef, then the two token-strings are 
compared textually. If the two token-strings are not identical 
(all white space is considered as equivalent), then the identifier 
is considered to be redefined. 

Note that #define and #undef declarations do not nest. The 
value of an identifier is solely determined by the most recent 
#define or #undef. 

File Inclusion 

A compiler control line of the form 

#include "filename" 

causes the replacement of that line by the entire contents of the 
file filename. The named file is searched for first in the 
directory of the file containing the ^include, and then in a 
sequence of specified or standard places. Alternatively, a 
control line of the form 
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#include <filename> 

searches only the specified or standard places and not the 
directory of the ^include. (How the places are specified is not 
part of the language.) 

#includes may be nested. 



Conditional Compilation 

A compiler control line of the form 

#if restricted-constant-expression 

checks whether the restricted-constant expression evaluates to 
nonzero. (Constant expressions are discussed in "CONSTANT 
EXPRESSIONS"; the following additional restrictions apply 
here: the constant expression may not contain sizeof casts, or 
an enumeration constant.) 

A restricted constant expression may also contain the 
additional unary expression 

defined identifier 

or 

defined( identifier 

which evaluates to one if the identifier is currently defined in 
the preprocessor and zero if it is not. 

All currently defined identifiers in restricted-constant- 
expressions are replaced by their token-strings (except those 
identifiers modified by defined) just as in normal text. The 
restricted constant expression will be evaluated only after all 
expressions have finished. During this evaluation, all undefined 
(to the procedure) identifiers evaluate to zero. 
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A control line of the form 

#ifdef identifier 

checks whether the identifier is currently defined in the 
preprocessor; i.e., whether it has been the subject of a #define 
control line. It is equivalent to #ifdef (identifier) . A control 
line of the form 

#ifndef identifier 

checks whether the identifier is currently undefined in the 
preprocessor. It is equivalent to ^ifldef inediidentifier). 

All three forms are followed by an arbitrary number of lines, 
possibly containing a control line 

#else 

and then by a control line 

#endif 

If the checked condition is true, then any lines between #else 
and the matching #endif are ignored. If the checked condition 
is false, then any lines between the test and the matching 
#else or, lacking a #else, the matching #endif are ignored. 

These constructions may be nested. 
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Line Control 

For the benefit of other preprocessors which generate C 
programs, a line of the form 

#line constant "filename' 

causes the compiler to believe, for purposes of error diagnostics, 
that the line number of the next source line is given by the 
constant and the current input file is named by "filename" . If 
"filename" is absent, the remembered file name does not 
change. 



IMPLICIT DECLARATIONS 

It is not always necessary to specify both the storage class and 
the type of identifiers in a declaration. The storage class is 
supplied by the context in external definitions and in 
declarations of formal parameters and structure members. In a 
declaration inside a function, if a storage class but no type is 
given, the identifier is assumed to be int; if a type but no 
storage class is indicated, the identifier is assumed to be auto. 
An exception to the latter rule is made for functions because 
auto functions do not exist. If the type of an identifier is 
"function returning . . . ," it is implicitly declared to be extern. 

In an expression, an identifier followed by ( and not already 
declared is contextually declared to be "function returning int." 



TYPES REVISITED 

This part summarizes the operations which can be performed 
on objects of certain types. 
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Structures and Unions 

Structures and unions may be assigned, passed as arguments to 
functions, and returned by functions. Other plausible 
operators, such as equality comparison and structure casts, are 
not implemented. 

In a reference to a structure or union member, the name on the 
right of the — > or the . must specify a member of the 
aggregate named or pointed to by the expression on the left. In 
general, a member of a union may not be inspected unless the 
value of the union has been assigned using that same member. 
However, one special guarantee is made by the language in 
order to simplify the use of unions: if a union contains several 
structures that share a common initial sequence and if the 
union currently contains one of these structures, it is permitted 
to inspect the common initial part of any of the contained 
structures. For example, the following is a legal fragment: 



2-59 



C LANGUAGE 



union 

{ 

struct 




int 

} n; 

struct 


type; 


int 
int 


type; 
intnode; 


} ni; 
struct 




int 
float 
}nf; 


type; 
floatnode; 


u.nf.type = FLOAT; 
u.nf.floatnode = 3.14; 



if (u.n.type == FLOAT) 

... sin(u.nf.floatnode) ... 



Functions 

There are only two things that can be done with a function: call 
it or take its address. If the name of a function appears in an 
expression not in the function-name position of a call, a pointer 
to the function is generated. Thus, to pass one function to 
another, one might say 

int f(); 

g(f); 
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Then the definition of g might read 

g(funcp) 

int (*funcp)(); 

{ 

(*funcp)(); 

} 

Notice that f must be declared explicitly in the calling routine 
since its appearance in g(f) was not followed by (. 

Arrays, Pointers, and Subscripting 

Every time an identifier of array type appears in an expression, 
it is converted into a pointer to the first member of the array. 
Because of this conversion, arrays are not lvalues. By 
definition, the subscript operator [] is interpreted in such a way 
that E1[E2] is identical to *((E1)+(E2)). Because of the 
conversion rules which apply to +, if El is an array and E2 an 
integer, then E1[E2] refers to the E2 -th member of El. 
Therefore, despite its asymmetric appearance, subscripting is a 
commutative operation. 

A consistent rule is followed in the case of multidimensional 
arrays. If E is an -^-dimensional array of rank iXjX...Xk, then 
E appearing in an expression is converted to a pointer to an 
(n-l)-dimensional array with rank jX...Xk. If the * operator, 
either explicitly or implicitly as a result of subscripting, is 
applied to this pointer, the result is the pointed-to (n-1)- 
dimensional array, which itself is immediately converted into a 
pointer. 

For example, consider 
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int x[3][5]; 

Here x is a 3X5 array of integers. When x appears in an 
expression, it is converted to a pointer to (the first of three) 5- 
membered arrays of integers. In the expression x[i], which is 
equivalent to *(x+i), x is first converted to a pointer as 
described; then i is converted to the type of x, which involves 
multiplying i by the length the object to which the pointer 
points, namely 5-integer objects. The results are added and 
indirection applied to yield an array (of five integers) which in 
turn is converted to a pointer to the first of the integers. If 
there is another subscript, the same argument applies again; 
this time the result is an integer. 

Arrays in C are stored row-wise (last subscript varies fastest) 
and the first subscript in the declaration helps determine the 
amount of storage consumed by an array. Arrays play no other 
part in subscript calculations. 



Explicit Pointer Conversions 

Certain conversions involving pointers are permitted but have 
implementation-dependent aspects. They are all specified by 
means of an explicit type-conversion operator, see "Unary 
Operators" under"EXPRESSIONS" and "Type Names"under 
"DECLARATIONS." 



A pointer may be converted to any of the integral types large 
enough to hold it. Whether an int or long is required is 
machine dependent and may also depend on the pointer type. 
The mapping function is also machine dependent but is 
intended to be unsurprising to those who know the addressing 
structure of the machine. Details for some particular machines 
are given below. 

An object of integral type may be explicitly converted to a 
pointer. The mapping always carries an integer converted from 
a pointer back to a pointer which points to the same location 
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but is otherwise machine dependent. 

A pointer to one type may be converted to a pointer to another 
type. The resulting pointer may cause addressing exceptions 
upon use if the subject pointer does not refer to an object 
suitably aligned in storage. 

For example, a storage-allocation routine might accept a size 
(in bytes) of an object to allocate, and return a char pointer; it 
might be used in this way. 

extern char *alloc(); 
double *dp; 

dp = (double *) alloc(sizeof(double)); 
*dp = 22.0 / 7.0; 

The alloc must ensure (in a machine-dependent way) that its 
return value is suitable for conversion to a pointer to double; 
then the use of the function is portable. 

The pointer representation on the PDP-11 corresponds to a 16- 
bit integer and measures bytes. The char's have no alignment 
requirements; everything else must have an even address. 

On the VAX-11, pointers are 32 bits long and measure bytes. 
Elementary objects are aligned on a boundary equal to their 
length, except that double quantities need be aligned only on 
even 4-byte boundaries. Aggregates are aligned on the strictest 
boundary required by any of their constituents. 

The 3B20 has 24-bit pointers placed into 32-bit quantities. 

The UNIX PC has 32-bit pointers. Most objects are aligned on 
4-byte boundaries. Shorts are aligned in all cases on 2-byte 
boundaries. Arrays of characters, all structures, ints, longs, 
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floats, and doubles are aligned on 4-byte boundries; but 
structure members may be packed tighter. 



CONSTANT EXPRESSIONS 

In several places C requires expressions that evaluate to a 
constant: after case, as array bounds, and in initializers. In 
the first two cases, the expression can involve only integer 
constants, character constants, casts to integral types, 
enumeration constants, and sizeof expressions, possibly 
connected by the binary operators 

+ - * / % & I ^ «»==!=<><=>=&& I I 

or by the unary operators 



or by the ternary operator 



Parentheses can be used for grouping but not for function calls. 

More latitude is permitted for initializers; besides constant 
expressions as discussed above, one can also use floating 
constants and arbitrary casts and can also apply the unary & 
operator to external or static objects and to external or static 
arrays subscripted with a constant expression. The unary & 
can also be applied implicitly by appearance of unsubscripted 
arrays and functions. The basic rule is that initializers must 
evaluate either to a constant or to the address of a previously 
declared external or static object plus or minus a constant. 
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PORTABILITY CONSIDERATIONS 

Certain parts of C are inherently machine dependent. The 
following list of potential trouble spots is not meant to be all- 
inclusive but to point out the main ones. 



Purely hardware issues like word size and the properties of 
floating point arithmetic and integer division have proven in 
practice to be not much of a problem. Other facets of the 
hardware are reflected in differing implementations. Some of 
these, particularly sign extension (converting a negative 
character into a negative integer) and the order in which bytes 
are placed in a word, are nuisances that must be carefully 
watched. Note that unsigned chars do not have this problem 

The number of register variables that can actually be placed 
in registers varies from machine to machine as does the set of 
valid types. Nonetheless, the compilers all do things properly 
for their own machine; excess or invalid register declarations 
are ignored. 

Dubious codingpractices, such as neglecting type conversions 
when passing arguments to functions, can cause trouble. Lint 
can be used to detect problems of this type. 

The order of evaluation of function arguments is not specified 
by the language. The order in which side effects take place is 
also unspecified. 

Since character constants are really objects of type int, 
multicharacter character constants may be permitted. The 
specific implementation is very machine dependent because the 
order in which characters are assigned to a word varies from 
one machine to another. 

Fields are assigned to words and characters to integers right to 
left on some machines and left to right on other machines. 
These differences are invisible to isolated programs that do not 
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indulge in type punning (e.g., by converting an int pointer to a 
char pointer and inspecting the pointed-to storage) but must 
be accounted for when conforming to externally-imposed 
storage layouts. 



SYNTAX SUMMARY 

This summary of C syntax is intended more for aiding 
comprehension than as an exact statement of the language. 



Expressions 

The basic expressions are: 



expression: 

primary 

* expression 

&lvalue 

- expression 

! expression 

" expression 

++ lvalue 

— lvalue 

lvalue ++ 

lvalue — 

sizeof expression 

sizeof (type-name) 

( type-name ) expression 

expression hinop expression 

expression ? expression : expression 

lvalue asgnop expression 

expression , expression 
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primary: 

identifier 

constant 

string 

( expression ) 

primary ( expression-list ) 

primary [ expression ] 

primary . identifier 

primary -> identifier 

lvalue: 

identifier 

primary [ expression ] 

lvalue . identifier 

primary -> identifier 

* expression 

( lvalue ) 

The primary-expression operators 

[] • -> 

have highest priority and group left to right. The unary 
operators 

*&-!'++ — sizeof ( type-name ) 



have priority belov^ the primary operators but higher than any 
binary operator and group right to left. Binary operators group 
left to right; they have priority decreasing as indicated below. 
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binop: 

* / % 

+ - 

» « 

< > <= >= 

& 



I 

&& 



The conditional operator groups right to left. 

Assignment operators all have the same priority and all group 
right to left. 



asgnop: 

= += _= *= /= %= »= «= &= ^= 1 = 



The comma operator has the lowest priority and groups left to 
right. 



Declarations 



declaration: 

decl-specifiers init-declarator-list 

ded-specifiers: 

type-specifier decl-specifiers 
sc-specifier decl-specifiers 



opt ' 
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sc-specifier: 
auto 
static 
extern 
register 
typedef 

type-specifier: 

struct-or-union-specifier 

typedef-name 

enum-specifier 
basic-type-specifier: 

basic-type 

basic-type basic-type-specifiers 
basic-type: 

char 

short 

int 

long 

unsigned 

float 

double 

void 

enum-specifier: 

enum { enum-list } 

enum identifier { enum-list } 

enum identifier 

enum-list: 

enumerator 
enum-list , enumerator 

enumerator: 
identifier 
identifier = constant-expression 
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init-declarator-list: 
init-declarator 
init-declarator , init-declarator-list 



opt 



init-declarator: 

declarator initializer 

declarator: 

identifier 

( declarator ) 

* declarator 

declarator () 

declarator [ constant-expression , ] 

struct-or-union-specifier: 

struct { struct-decl-list } 

struct identifier { struct-decl-list } 

struct identifier 

union { struct-decl-list } 

union identifier { struct-decl-list } 

union identifier 

struct-decl-list: 

struct-declaration 
struct-declaration struct-decl-list 

struct-declaration: 

type-specifier struct-declarator-list ; 

struct-declarator-list: 
struct-declarator 
struct-declarator , struct-declarator-list 
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struct-declarator: 
declarator 

declarator : constant-expression 
: constant-expression 

initializer: 

= expression 

= { initializer-list } 

= { initializer-list , } 

initializer-list- 
expression 

initializer-list , initializer-list 
{ initializer-list } 
{ initializer-list , } 

type-name: 

type-specifier abstract-declarator 

abstract-declarator- 
empty 

( abstract-declarator ) 
* abstract-declarator 
abstract-declarator () 
abstract-declarator [ constant-expression ] 

typedef-name: 
identifier 



Statements 



compound-statement: 

{ declaration-list , statement-list ^ } 
•^ opt opt ^ 
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declaration-list: 
declaration 
declaration declaration-list 

statement-list: 
statement 
statement statement-list 

statement: 

compound-statement 

expression ; 

if ( expression ) statement 

if ( expression ) statement else statement 

while ( expression ) statement 

do statement while ( expression ) ; 

for (exp^^^exp^jy^yexp^^^) statement 

switch ( expression ) statement 

case constant-expression : statement 

default ; statement 

break ; 

continue ; 

return ; 

return expression ; 

goto identifier ; 

identifier : statement 



External definitions 



program: 

external-definition 
external-definition program 

external-definition: 

function-definition 
data-definition 
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function-definition: 

ded-specifier function-declarator function-body 

function-declarator: 

declarator ( parameter-list ) 

parameter-list: 
identifier 
identifier , parameter-list 



function-body: 

declaration-list , compound-statement 

data-definition: 

extern declaration ; 
static declaration ; 



Preprocessor 



#define identifier token-string ^^^^ 

#define identifier{identifier,...ftoken-string 

#undef identifier 

#include "filename" 

^include <filename> 

#if restricted-constant-expression 

#ifdef identifier 

#ifndef identifier 

#else 

#endif 

#line constant " filename" 
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Chapter 3 
C LIBRARIES 

GENERAL 

This chapter and Chapter 4 describe the libraries that are 
supported on the UNIX operating system. A library is a 
collection of related functions and/or declarations that simplify 
programming effort by linking only what is needed, allowing 
use of locally produced functions, etc. All of the functions 
described are also described in Section 3 of the AT&T UNIX 
PC UNIX System V Manual. Most of the declarations 
described are in Section 5 of the AT&T UNIX PC UNIX 
System V Manual. The main libraries on the UNIX system are: 

C library This is the basic library for C language 

programs. The C library is composed of 
functions and declarations used for file 
access, string testing and manipulation, 
character testing and manipulation, 
memory allocation, and other functions. 
This library is described later in this 
chapter. 

Object file library 

This library provides functions for the 
access and manipulation of object files. 
This library is described in Chapter 4. 

Math library This library provides exponential, bessel 

functions, logarithmic, hyperbolic, and 
trigonometric functions. This library is 
described in Chapter 4. 
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tam library This library contains the AT&T UNIX 

PC "terminal access method" (tam) 
functions. 

Some libraries consist of two portions— functions and 
declarations. In some cases, the user must request that the 
functions (and/or declarations) of a specific library be included 
in a program being compiled. In other cases, the functions 
(and/or declarations) are included automatically. 



Including Functions 

When a program is being compiled, the compiler will 
automatically search the C language library to locate and 
include functions that are used in the program. This is the case 
only for the C library and no other library. In order for the 
compiler to locate and include functions from other libraries, 
the user must specify these libraries on the command line for 
the compiler. For example, when using functions of the math 
library, the user must request that the math library be 
searched by including the argument — Im on the command line, 
such as: 

cc file.c -Im 

The argument — Im must come after all files that reference 
functions in the math library in order for the link editor to 
know which functions to include in the a.out file. 

This method should be used for all functions that are not part 
of the C language library. 
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Including Declarations 

Some functions require a set of declarations in order to operate 
properly. A set of declarations is stored in a file under the 
/usr/include directory. These files are referred to as header 
files. In order to include a certain header file, the user must 
specify this request within the C language program. The 
request is in the form: 

#include <file.h> 

where file.h is the name of the file. Since the header files 
define the type of the functions and various preprocessor 
constants, they must be included before invoking the functions 
they declare. 

The remainder of this chapter describes the functions and 
header files of the C Library. The description of the library 
begins with the actions required by the user to include the 
functions and/or header files in a program being compiled (if 
any). Following the description of the actions required is 
information in three-column format of the form: 

function reference (N) Brief description. 



The functions are grouped by type while the reference refers to 
section 'N' in the AT&T UNIX PC UNIX System V Manual. 
Following this, are descriptions of the header files associated 
with these functions (if any). 
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THE C LIBRARY 

The C library consists of several types of functions. All the 
functions of the C library are loaded automatically by the 
compiler. Various declarations must sometimes be included by 
the user as required. The functions of the C library are divided 
into the following types: 

• Input/output control 

• String manipulation 

• Character manipulation 

• Time functions 

• Miscellaneous functions. 



Input/Output Control 

These functions of the C library are automatically included as 
needed during the compiling of a C language program. No 
command line request is needed. 

The header file required by the input/output functions should 
be included in the program being compiled. This is 
accomplished by including the line: 

#include <stdio.h> 

near the beginning of each file that references an input or 
output function. 

The input/output functions are grouped into the following 
categories: 

• File access 

• File status 

• Input 

• Output 

• Miscellaneous. 
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FUNCTION REFERENCE 

fclose fclose(3S) 

fdopen fopen(3S) 



fileno 



fopen 



freopen 



fseek 



pclose 



popen 



rewind 



f error (3S) 



fopen(3S) 



fopen(3S) 

fseek(3S) 
popen(3S) 
popen(3S) 

fseek(3S) 



BRIEF DESCRIPTION 

Close an open stream. 

Associate stream with 
an open(2) ed file. 

File descriptor associated 
with an open stream. 

Open a file with 
specified permissions. 
Fopen returns a pointer 
to a stream which is 
used in subsequent 
references to the file. 

Substitute named file 
in place of open 
stream. 

Reposition the file 
pointer. 

Close a stream opened 
by popen. 

Create pipe as a stream 
between calling process 
and command. 

Reposition file 
pointer at beginning 
of file. 
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setbuf 



setbuf(3S) 



Assign buffering to 
stream. 



File Status Functions 



FUNCTION 
clearerr 

feof 

ferror 

ftell 



REFERENCE 
ferror(3S) 

ferror(3S) 

ferror(3S) 

fseek(3S) 



BRIEF DESCRIPTION 

Reset error condition on 
stream. 

Test for "end of file" 
on stream. 

Test for error condition 
on stream. 

Return current position 
in the file. 



Input Functions 



FUNCTION REFERENCE BRIEF DESCRIPTION 



fgetc 


getc(3S) 


True function for getc 

(3S). 


fgets 


gets(3S) 


Read string from strea: 


fread 


fread(3S) 


General buffered read 
from stream. 


fscanf 


scanf(3S) 


Formatted read from 
stream. 
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getc 


getc(3S) 


Read character from 
stream. 


getchar 


getc(3S) 


Read character from 
standard input. 


gets 


gets(3S) 


Read string from standard input. 


getw 


getc(3S) 


Read word from stream. 


scanf 


scanf(3S) 


Read using format from 
standard input. 


sscanf 


scanf(3S) 


Formatted from 
string. 


ungetc 


ungetc (3S) 


Put back one character on 
stream. 



Output Functions 



FUNCTION REFERENCE 
fflush fclose(3S) 



BRIEF DESCRIPTION 

Write all currently buffered 
characters from stream. 



fprintf 


printf(3S) 


Formatted write to 
stream. 


fputc 


putc(3S) 


True function for putc 
(3S). 


fputs 


puts(3S) 


Write string to stream. 


fwrite 


fread(3S) 


General buffered write to 
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printf 



printf(3S) 



stream. 

Print using format to 
standard output. 



putc 


putc(3S) 


Write character to 
standard output. 


putchar 


putc(3S) 


Write character to 
standard output. 


puts 


puts(3S) 


Write string to 
standard output. 


putw 


putc(3S) 


Write word to stres 


sprintf 


printf(3S) 


Formatted write to 
string. 



Miscellaneous Functions 



FUNCTION REFERENCE 

ctermid ctermid(3S) 



cuserid 

system 
tempnam 



cuserid (3S) 

system(3S) 
tempnam (3S) 



BRIEF DESCRIPTION 

Return file name for 
controlling terminal. 

Return login name for 
owner of current process. 

Execute shell command. 

Create temporary file 
name using directory and 
prefix. 
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tmpnam 
tmpfile 



tmpnam (3S) 
tmpfile (3S) 



Create temporary file 
name. 

Create temporary file. 



String Manipulation Functions 

These functions are used to locate characters within a string, 
copy, concatenate, and compare strings. These functions are 
automatically located and loaded during the compiling of a C 
language program. No command line request is needed since 
these functions are part of the C library. The string 
manipulation functions are declared in a header file that may 
be included in the program being compiled. This is 
accomplished by including the line: 

/include <string.h> 

near the beginning of each file that uses one of these functions. 



FUNCTION 


REFERENCE 


BRIEF DESCRIPTION 


strcat 


string(3C) 


Concatenate two strings. 


strchr 


string (3C) 


Search string for 
character. 


strcmp 


string(3C) 


Compares two strings. 


strcpy 


string(3C) 


Copy string. 


strcspn 


string(3C) 


Length of initial string 



not containing set of 
characters. 
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strlen 


string(3C) 


Length of string. 


strncat 


string(3C) 


Concatenate two strings 
with a maximum length. 


strncmp 


string(3C) 


Compares two strings 
with a maximum length. 


strncpy 


string(3C) 


Copy string over string 
with a maximum length. 


strpbrk 


string(3C) 


Search string for any 
set of characters. 


strrchr 


string (3C) 


Search string backwards 
for character. 


strspn 


string(3C) 


Length of initial string 
containing set of 
characters. 


strtok 


string (3C) 


Search string for token 



separated by any of a 
set of characters. 



Character Manipulation 

The following functions and declarations are used for testing 
and translating ASCII characters. These functions are located 
and loaded automatically during the compiling of a C language 
program. No command line request is needed since these 
functions are part of the C library. 

The declarations associated with these functions should be 
included in the program being compiled. This is accomplished 
by including the line: 



#include <ctype.h> 
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near the beginning of the file being compiled. 



Character Testing Functions 

These functions can be used to identify characters as uppercase 
or lowercase letters, digits, punctuation, etc. 



FUNCTION REFERENCE 

isalnum ctype(3C) 



isalpha 
isascii 

iscntrl 

isdigit 
isgraph 

islower 

isprint 

ispunct 
isspace 



ctype(3C 
ctype(3C 

ctype(3C 

ctype(3C 

ctype(3C 

ctype(3C 
ctype(3C 

ctype(3C 
ctype(3C 



BRIEF DESCRIPTION 

Is character 
alphanumeric? 

Is character alphabetic? 

Is integer ASCII 
character? 

Is character a control 
character? 

Is character a digit? 

Is character a printable 
character? 

Is character a 
lowercase letter? 

Is character a printing 
character including 
space? 

Is character a 
punctuation character? 

Is character a white 
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isupper 



isxdigit 



ctype(3C) 



ctype(3C) 



space character? 

Is character an uppercase 
letter? 

Is character a hex digit? 



Character Translation Functions 

These functions provide translation of uppercase to lowercase, 
lowercase to uppercase, and integer to ASCII. 



FUNCTION 
toascii 

tolower 

toupper 



REFERENCE 

conv(3C) 

conv(3C) 
conv(3C) 



BRIEF DESCRIPTION 

Convert integer to 
ASCII character. 

Convert character to 
lowercase. 

Convert character to 
uppercase. 



Time Functions 

These functions are used for accessing and reformatting the 
system's idea of the current date and time. These functions are 
located and loaded automatically during the compiling of a C 
language program. No command line request is needed since 
these functions are part of the C library. 

The header file associated with these functions should be 
included in the program being compiled. This is accomplished 
by including the line: 

#include <time.h> 
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near the beginning of any file using the time functions. 

These functions (except tzset) convert a time such as returned 
by time(2). 



FUNCTION 


REFERENCE 


BRIEF DESCRIP1 


asctime 


ctime(3C) 


Return string 
representation 
of date and time. 


ctime 


ctime (3C) 


Return string 
representation of 
date and time, give 
integer form. 


gmtime 


ctime (3C) 


Return Greenwich 
Mean Time. 


localtime 


ctime (3C) 


Return local time. 


tzset 


ctime (3C) 


Set time zone field 
from environment 
variable. 



Miscellaneous Functions 

These functions support a wide variety of operations. Some of 
these are numerical conversion, password file and group file 
access, memory allocation, random number generation, and 
table management. These functions are automatically located 
and included in a program being compiled. No command line 
request is needed since these functions are part of the C 
library. 

Some of these functions require declarations to be included. 
These are described following the descriptions of the functions. 
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Numerical Conversion 

The following functions perform numerical conversion. 



FUNCTION REFERENCE 



BRIEF DESCRIPTION 



a641 


a641(3C) 


Convert string to 
base 64 ASCII. 


atof 


atof(3C) 


Convert string to 
floating. 


atoi 


atof(3C) 


Convert string to 
integer. 


atol 


atof(3C) 


Convert string to long. 


frexp 


frexp (3C) 


Split floating into 
mantissa and exponent. 


13tol 


13tol(3C) 


Convert 3-byte integer 
to long. 


ltol3 


13tol(3C) 


Convert long to 3-byte 
integer. 


Idexp 


frexp(3C) 


Combine mantissa and 
exponent. 


164a 


a641(3C) 


Convert base 64 ASCII 
to string. 


modf 


frexp (3C) 


Split mantissa into 



integer and fraction. 
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DES Algorithm Access 

The following functions allow access to the Data Encryption 
Standard (DES) algorithm used on the UNIX operating system. 
The DES algorithm is implemented with variations to frustrate 
use of hardware implementations of the DES for key search. 



FUNCTION 

crypt 

encrypt 

setkey 



REFERENCE 

crypt (3C) 
crypt (3C) 

crypt (3C) 



BRIEF DESCRIPTION 

Encode string. 

Encode/decode string of 
Os and Is. 

Initialize for subsequent 
use of encrypt. 



Group File Access 

The following functions are used to obtain entries from the 
group file. Declarations for these functions must be included in 
the program being compiled with the line: 

/include <grp.h> 



FUNCTION REFERENCE 

endgrent getgrent(3C) 



getgrent 



getgrent(3C) 



BRIEF DESCRIPTION 

Close group file being 
processed. 

Get next group file 
entry. 
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getgrgid 



getgrnam 



setgrent 



getgrent(3C) 
getgrent(3C) 
getgrent(3C) 



Return next group with 
matching gid. 

Return next group with 
matching name. 

Rewind group file being 
processed. 



Password File Access 

These functions are used to search and access information 
stored in the password file (/etc/passwd). Some functions 
require declarations that can be included in the program being 
compiled by adding the line: 

/include <pwd.h> 



FUNCTION REFERENCE 

endpwent getpwent(3C) 



getpw 



getpwent 



getpw (3C) 



getpwent (3C) 



getpwnam getpwent(3C) 



BRIEF DESCRIPTION 

Close password file 
being processed. 

Search password file 
for uid. 

Get next password file 
entry. 

Return next entry with 
matching name. 
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getpwuid 

putpwent 
setpwent 



getpwent(3C) 

putpwent (3C) 
getpwent(3C) 



Return next entry with 
matching uid. 

Write entry on stream. 

Rewind password file 
being accessed. 



Parameter Access 

The following functions provide access to several different types 
of paramenters. None require any declarations. 



FUNCTION REFERENCE 

getopt getopt(3C) 



getcwd 



getenv 



getpass 



getcwd(3C) 



getenv(3C) 



getpass (3C) 



BRIEF DESCRIPTION 

Get next option from 
option list. 

Return string 
representation of 
current working directory. 

Return string value 
associated with 
environment variable. 

Read string from terminal 
without echoing. 



Hash Table Management 

The following functions are used to manage hash search tables. 
The header file associated with these functions should be 
included in the program being compiled. This is accomplished 
by including the line: 

#include <search.h> 
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near the beginning of any file using the search functions. 



FUNCTION 
hcreate 
hdestroy 
hsearch 



REFERENCE 
hsearch(3C) 
hsearch(3C) 
hsearch(3C) 



BRIEF DESCRIPTION 

Create hash table. 

Destroy hash table. 

Search hash table for 
entry. 



Binary Tree Management 

The following functions are used to manage a binary tree. The 
header file associated with these functions should be included 
in the program being compiled. This is accomplished by 
including the line: 

#include <search.h> 

near the beginning of any file using the search functions. 



FUNCTION 
tdelete 

tsearch 



REFERENCE 

tsearch (3C) 

tsearch(3C) 



BRIEF DESCRIPTION 

Deletes nodes from 
binary tree. 

Look for and add 
element to binary 
tree. 
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twalk 



tsearch(3C) 



Walk binary tree. 



Table Management 

The following functions are used to manage a table. Since none 
of these functions allocate storage, sufficient memory must be 
allocated before using these functions. The header file 
associated with these functions should be included in the 
program being compiled. This is accomplished by including the 
line: 



#include <search.h> 



near the beginning of any file using the search functions. 



FUNCTION 
bsearch 



REFERENCE 
bsearch(3C) 



BRIEF DESCRIPTION 

Search table using 
binary search. 



Isearch 



lsearch(3C) 



Look for and add 
element in binary 
tree. 



qsort 



qsort(3C) 



Sort table using 
quick-sort algorithm. 



Memory Allocation 

The following functions provide a means by which memory can 
be dynamically allocated or freed. 
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FUNCTION 


REFERENCE 


BRIEF DESCRIPTION 


calloc 


malloc(3C) 


Allocate zeroed storage. 


free 


malloc(3C) 


Free previously allocated 
storage. 


malloc 


malloc(3C) 


Allocate storage. 


realloc 


malloc(3C) 


Change size of allocated 
storage. 



The following is another set of memory allocation functions 
available. 



FUNCTION 


REFERENCE 


BRIEF DESCRIPTION 


calloc 


malloc(3X) 


Allocate zeroed storage. 


free 


malloc(3X) 


Free previously allocated 
storage. 


malloc 


malloc(3X) 


Allocate storage. 



Pseudorandom Number Generation 

The following functions are used to generate pseudorandom 
numbers. The functions that end with 48 are a family of 
interfaces to a pseudorandom number generator based upon the 
linear congruent algorithm and 48-bit integer arithmetic. The 
rand and srand functions provide an interface to a 
multiplicative congruential random number generator with 
period of 232. 
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FUNCTION REFERENCE 

drand48 drand48(3C) 



lcong48 



lrand48 



rand 



seed48 



srand 



srand48 



drand48(3C) 



drand48(3C) 



mrand48 drand48(3C) 



rand(3C) 



drand48(3C) 



rand(3C) 



drand48(3C) 



BRIEF DESCRIPTION 

Random double over 
the interval [0 to 1). 

Set parameters for 
drand48, lrand48, 
and mrand48. 

Random long over the 
interval [0 to 2^^). 

Random long over the 
interval [-2^ to 2^1). 

Random integer over the 
interval [0 to 32767). 

Seed the generator for 
drand48, lrand48, and 
mrand48. 

Seed the generator 
for rand. 

Seed the generator for 
drand48, lrand48, and 
mrand48 using a long. 
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Signal Handling Functions 

The functions gsignal and ssignal implement a software 
facility similar to signal(2) in the AT&T UNIX System V 
Manual. This facility enables users to indicate the disposition 
of error conditions and allows users to handle signals for their 
own purposes. The declarations associated with these functions 
can be included in the program being complied by the line 

#include <signal.h> 

These declarations define ASCII names for the 15 software 
signals. 



FUNCTION 

gsignal 

ssignal 



REFERENCE 

ssignal(3C) 

ssignal(3C) 



BRIEF DESCRIPTION 

Send a software signal. 

Arrange for handling 
of software signals. 



Miscellaneous 

The following functions do not fall into any previously 
described category. 



FUNCTION REFERENCE 

abort abort (3C) 



BRIEF DESCRIPTION 

Cause an lOT signal 
to be sent to the 
process. 
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abs 



ecvt 



fcvt 



gcvt 



abs(3C) 

ecvt(3C) 
ecvt(3C) 

ecvt(3C) 
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Return the absolute 
integer value. 

Convert double to 
string. 

Convert double to 
string using Fortran 
Format. 

Convert double to 
string using Fortran 
F or E format. 



isatty 


ttyname(3C) 


Test whether integer 
file descriptor is 
associated with a 
terminal. 


mktemp 


mktemp(3C) 


Create file name 
using template. 


monitor 


monitor(3C) 


Cause process to record 
a histogram of program 
counter location. 


swab 


swab(3C) 


Swap and copy bytes. 


ttyname 


ttyname (3C) 


Return pathname of 



terminal associated with 
integer file descriptor. 
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Chapter 4 
THE OBJECT AND MATH LIBRARIES 

GENERAL 

This chapter describes the Object and Math Libraries that are 
supported on the UNIX operating system. A library is a 
collection of related functions and/or declarations that simplify 
programming effort. All of the functions described are also 
described in Section 3 of the AT&T UNIX PC UNIX System V 
Manual. Most of the declarations described are in Section 5 of 
the AT&T UNIX PC UNIX System Manual. The main 
libraries on the UNIX system are: 

C library This is the basic library for C language 

programs. The C library is composed of 
functions and declarations used for file 
access, string testing and manipulation, 
character testing and manipulation, 
memory allocation, and other functions. 
This library is described in Chapter 3. 

Object file library 

This library provides functions for the 
access and manipulation of object files. 
This library is described later in this 
chapter. 

Math library This library provides exponential, bessel 

functions, logarithmic, hyperbolic, and 
trigonometric functions. This library is 
also described later in this chapter. 

tarn library This library contains the AT&T UNIX 

PC "terminal access library" (tam) 
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functions. 



THE OBJECT FILE LIBRARY 

The object file library provides functions for the access and 
manipulation of object files. Some functions locate portions of 
an object file such as the symbol table, the file header, sections, 
and line number entries associated with a function. Other 
functions read these types of entries into memory. For a 
description of the format of an object file, see "The Common 
Object File Format" in Chapter 18. 

This library consists of several portions. The functions reside 
in /usrAihAibld.a and are located and loaded during the 
compiling of a C language program by a command line request. 
The form of this request is: 

cc file -lid 



which causes the link editor to search the object file library. 
The argument -lid must appear after all files that reference 
functions in lihld.aR. 

In addition, various header files must he included. This is 
accomplished by including the line: 

/include <stdio.h> 
/include <a.out.h> 
/include <ldfcn.h> 
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FUNCTION 
Idaclose 

Idahread 
Idaopen 

Idclose 

Idfhread 

Idgetname 

Idlinit 

Idlitem 

Idlread 
Idlseek 



REFERENCE 
Idclose (3X) 

Idahread (3X) 
ldopen(3X) 

Idclose (3X) 

Idfhread (3X) 

ldgetname(3X) 

Idlread (3X) 

Idlread (3X) 

ldlread(3X) 
ldlseek(3X) 



BRIEF DESCRIPTION 

Close object file being 
processed. 

Read archive header. 

Open object file for 
reading. 

Close object file being 
processed. 

Read file header of 
object file being 
processed. 

Retrieve the name of 
an object file symbol 
table entry. 

Prepare object file for 
reading line number 
entries via Idlitem. 

Read line number entry 
from object file after 
Idlinit. 

Read line number entry 
from object file. 

Seeks to the line number 
entries of the object 
file being processed. 
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Idnlseek 



Idnrseek 



Idnshread 



Idnsseek 



Idohseek 



Idopen 



Idrseek 



Idshread 



Idsseek 



ldlseek(3X) 
ldrseek(3X) 
Idshread (3X) 
ldsseek(3X) 

ldohseek(3X) 

Idopen (3X) 
ldrseek(3X) 

Idshread (3X) 

ldsseek(3X) 



Seeks to the line number 
entries of the object file 
being processed given 
the name of a section. 

Seeks to the relocation 
entries of the object file 
being processed given 
the name of a section. 

Read section header of 
the named section of the 
object file being 
processed. 

Seeks to the section of 
the object file being 
processed given the 
name of a section. 

Seeks to the optional 
file header of the object 
file being processed. 

Open object file for 
reading. 

Seeks to the relocation 
entries of the object file 
being processed. 

Read section header of 
an object file being 
processed. 

Seeks to the section of 
the object file being 
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Idtbindex 



ldtbindex(3X) 



Idtbread 



Idtbseek 



sgetl 



ldtbread(3X) 



Idtbseek (3X) 



sputl(3X) 



processed. 

Returns the long index 
of the symbol table entry 
at the current position of 
the object file being 
processed. 

Reads a specific 
symbol table entry 
of the object file 
being processed. 

Seeks to the symbol 
table of the object file 
being processed. 

Access long integer data 
in a machine independent 
format. 



sputl 



sputl(3X) 



Translate a long integer 
into a machine 
independent format. 



Common Object File Interface Macros (Idfcn.h) 

The interface between the calling program and the object file 
access routines is based on the defined type LDFILE which is 
defined in the header file Idfcn.h (see ldfcn(4)). The primary 
purpose of this structure is to provide uniform access to both 
simple object files and to object files that are members of an 
archive file. 



The function ldopen(3X) allocates and initializes the LDFILE 
structure and returns a pointer to the structure to the calling 
program. The fields of the LDFILE structure may be accessed 
individually through the following macros: the type macro 
returns the magic number of the file, which is used to 
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distinguish between archive files and simple object files. The 
lOPTR macro returns the file pointer which was opened by 
ldopen(3X) and is used by the input/output functions of the C 
library. The OFFSET macro returns the file address of the 
beginning of the object file. This value is non-zero only if the 
object file is a member of the archive file. The HEADER 
macro accesses the file header structure of the object file. 

Additional macros are provided to access an object file. These 
macros parallel the input/output functions in the C library; 
each macro translates a reference to an LDFILE structure 
into a reference to its file descriptor field. The available 
macros are described in ldfcn(4) in the AT&T UNIX System V 
Manual. 



THE MATH LIBRARY 

The math library consists of functions and a header file. The 
functions are located and loaded during the compiling of a C 
language program by a command line request. The form of this 
request is: 

cc file — Im 



which causes the link editor to search the math library. In 
addition to the request to load the functions, the header file of 
the math library should be included in the program being 
compiled. This is accomplished by including the line: 

/include <math.h> 

near the beginning of the (first) file being compiled. 

The functions are grouped into the following categories: 
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• Trigonometric functions 

• Bessel functions 

• Hyperbolic functions 

• Miscellaneous functions. 

Trigonometric Functions 

These functions are used to compute angles (in radian 
measure), sines, cosines, and tangents. All of these values are 
expressed in double precision. 



FUNCTION 


REFERENCE 


BRIEF DESCRIPTION 


acos 


trig(3M) 


Return arc cosine. 


asin 


trig(3M) 


Return arc sine. 


atan 


trig(3M) 


Return arc tangent. 


atan2 


trig(3M) 


Return arc tangent of 
a ratio. 


cos 


trig(3M) 


Return cosine. 


sin 


trig(3M) 


Return sine. 


tan 


trig(3M) 


Return tangent. 
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Bessel Functions 

These functions calculate bessel functions of the first and 
second kinds of several orders for real values. The bessel 
functions are jO, jl, jn, yO, yl, and yn. The functions are 
located in section bessel(3M). 



Hyperbolic Functions 

These functions are used to compute the hyperbolic sine, cosine, 
and tangent for real values. 



FUNCTIOI^ 


/ REFERENCE 


BRIEF DESCRIPTION 


cosh 


sinh(3M) 


Return hyperbolic cosine. 


sinh 


sinh(3M) 


Return hyperbolic sine. 


tanh 


sinh(3M) 


Return hyperbolic tangent, 



Miscellaneous Functions 

These functions cover a wide variety of operations, such as 
natural logarithm, exponential, and absolute value. In addition, 
several are provided to truncate the integer portion of double 
precision numbers. 



FUNCTION REFERENCE 
ceil floor(3M) 



exp 



exp(3M) 



BRIEF DESCRIPTION 

Returns the smallest 
integer not less than a 
given value. 

Returns the exponential 
function of a given value. 
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fabs 



floor 



fmod 



gamma 



hypot 



floor (3M) Returns the absolute value 

of a given value. 

floor (3M) Returns the largest integer 

not greater than a given 
value. 

floor (3M) Returns the remainder 

produced by the division of 
two given values. 

gamma(3M) Returns the natural log of 

the absolute value of the 
result of applying the 
gamma function to a 
given value. 

hypot (3M) Returns the square root 

of the sum of the squares 
of two numbers. 



log 


exp(3M) 


Returns the natural 
logarithm of a given 
value. 


loglO 


exp(3M) 


Returns the logarithm base 
ten of a given value. 


matherr 


matherr(3M) 


Error-handling function. 


pow 


exp(3M) 


Returns the result of a 



sqrt 



exp(3M) 



given value raised to 
another given value. 

Returns the square root 
of a given value. 
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Chapter 5 
COMPILER AND C LANGUAGE 



This chapter describes the UNIX System's C compiler, cc, and 
the C programming language that the compiler translates. The 
compiler is part of the UNIX System Software Generation 
System (SGS). 

The SGS is a package of tools used to create and test programs 
for UNIX Systems. These tools allow high-level program 
coding and source-level testing of code. The C language is 
implemented for high-level programming; it contains many 
control and structuring facilities that greatly simplify the task 
of algorithm construction. Within the SGS, a C compiler 
converts C programs into assembly language programs that are 
ultimately translated into object files by the assembler, as. 
The link editor. Id, collects and merges object files into 
executable load modules. Each of these tools preserves all 
symbolic information necessary for meaningful symbolic testing 
at C-language source level. In addition, a utility package aids 
in testing and debugging. 



USE OF THE COMPILER 

The main command of the SGS is cc; it operates much like the 
UNIX system cc command. To use the compiler, first create a 
file (typically by using the UNIX system text editor) containing 
C source code. The name of the file created must have a special 
format; the last two characters of the file name must be .c as 
in file I.e. 

Next, enter the SGS command 
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cc options file, c 

to invoke the compiler on the C source file file.c with the 
appropriate options selected. The compilation process creates 
an absolute binary file named a.out that reflects the contents 
of file.c and any referenced library routines. The resulting 
binary file, a.out, can then be executed on the target system. 

Options can control the steps in the compilation process. When 
none of the controlling options are used, and only one file is 
named, cc automatically calls the assembler, as, and the link 
editor. Id, thus resulting in an executable file, named a.out. If 
more than one file is named in a command, 

cc filel. c file2. c fileS. c 

then the output will be placed on iiles filel.o, file2.o, and fileS.o. 
These files can then be linked and executed through the Id 
command. 

The cc compiler also accepts input file names with the last two 
characters .s. The .s signifies a source file in assembly 
language. The cc compiler passes this type of file directly to 
as, which assembles the file and places the output on a file of 
the same name with .o substituted for .s. 

Cc is based on a portable C compiler and translates C source 
files into assembly code. Whenever the command cc is used, 
the standard C preprocessor (which resides on the file /lib/cpp) 
is called. The preprocessor performs file inclusion and macro 
substitution. The preprocessor is always invoked by cc and 
need not be called directly by the programmer. Then, unless 
the appropriate flags are set, cc calls the assembler and the 
link editor to produce an executable file. 
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COMPILER OPTIONS 

All options recognized by the cc command are listed below: 



Option 



Argument Description 



none 



none 



none 



none 



g 



none 



Display without executing each 
command that cc generates. 

Suppress the link-editing phase 
of compilation and force an 
object file to be produced 
even if only one file is 
compiled. 

Arrange for the compiler to produce 
code which counts the number 
of times each routine is called; 
also, if link editing takes 
place, replace the standard 
startoff routine by one which 
automatically calls monitor(SC) 
at the start and arrange 
to write out a mon.out file 
at normal termination of 
execution of the object program. 
An execution profile can be 
generated by use of prof{l). 

Link the object program with the 
floating-point interpreter 
for systems without 
hardware floating-point. 

Cause the compiler to generate 
additional information needed 
for the use of sdb{l). 
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-O 



none 



-S 



none 



-E 



none 



none 



B 



string 



This flag and -o takes 
(described below) are mutually 
exclusive, -g takes precedence 
when both are specified. 

Invoke an object-code 
optimizer. This flag and -g 
(described above) are mutually 
exclusive, -g takes precedence 
when both are specified. 

Compile the named C program 
and leave the assembler 
language output on corre- 
sponding files suffixed .s. 

Run only cpp{l) 
on the named C programs 
and send the result to 
standard output. 

Run only cpp{l) on 
the named C programs, 
and leave the result on 
corresponding files suffixed .i. 

Construct pathnames 

for subsitute compiler, 

assembler and link editor 

passes by concatenating 

string with the 

suffixes cpp, cl, c2, as 

and Id. If string is 

empty it is taken to be /lib/o. 



-t 



[p012al]Find only the 



designated compiler, 



5-4 



COMPILER AND C LANGUAGE 



assembler and link editor 
passes in the file whose 
names are constructed by 
a — B option. In the absence 
of a — B option, the string 
is taken to be //lib//n -t 
" " is equivalent to -tp012. 

— W c,argl[,arg2...]'llsind off the argument(s) argi 

to pass c, where c is one of 
[p012al], indicating preprocessor, 
compiler first pass, compiler second 
pass, optimizer, assembler, or link 
editor, respectively. 

— d none This option is no longer 

allowed because of a conflict of 
meaning. The -W option must be used 
to specify precisely its destination. 
To indicate the -dn option for the 
VAX assembler use -Wa, -dn. To 
indicate the — d option for the link 
editor, use — Wl,— d. 



This part provides additional information for those options not 
completely described above. 

By using appropriate options, compilation can be terminated 
early to produce one of several intermediate translations such 
as relocatable object files (-c option), assembly source 
expansions for C code (-S option), or the output of the 
preprocessor (-P option). In general, the intermediate files 
may be saved and later resubmitted to the cc command, with 
other files or libraries included as necessary. 

When compiling C source files, the most common practice is to 
use the -c option to save relocatable files. Subsequent changes 
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to one file do not then require that the others be recompiled. A 
separate call to cc without the -c option then creates the linked 
executable a.out file. A relocatable object file created under 
the -c option is named by adding a .o suffix to the source file 
name. 

The -W option provides the mechanism to specify options for 
each step that is normally invoked from the cc command line. 
These steps are preprocessing, the first pass of the compiler, 
the second pass of the compiler, optimization, assembly, and 
link editing. At this time, only assembler and link editor 
options can be used with the -W option. 

When the -P option is used, the compilation process stops after 
only preprocessing, with output left on file.i. This file will be 
unsuitable for subsequent processing by cc. 

The -O option decreases the size and increases the execution 
speed of programs by moving, merging, and deleting code. 

The -g option produces information for a symbolic debugger. 
The SGS currently supports the SDB symbolic debugger. 
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Chapter 6 
A C PROGRAM CHECKER— "lint" 

GENERAL 

The lint program examines C language source programs 
detecting a number of bugs and obscurities. It enforces the 
type rules of C language more strictly than the C compiler. It 
may also be used to enforce a number of portability restrictions 
involved in moving programs between different machines 
and/or operating systems. Another option detects a number of 
wasteful or error prone constructions which nevertheless are 
legal. The lint program accepts multiple input files and library 
specifications and checks them for consistency. 

Usage 

The lint command has the form: 

lint [options] files ... library-descriptors ... 

where options are optional flags to control lint checking and 
messages; files are the files to be checked which end with .c or 
.In; and library-descriptors are the names of libraries to be used 
in checking the program. 

The options that are currently supported by the lint command 
are: 

—a Suppress messages about assignments of long 

values to variables that are not long. 

— b Suppress messages about break statements that 

cannot be reached. 
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— c Only check for intra-file bugs; leave external 

information in files suffixed with .In. 

— h Do not apply heuristics (which attempt to detect 

bugs, improve style, and reduce waste). 

— n Do not check for compatibility with either the 

standard or the portable lint library. 

—0 name Create a lint library from input files named Uib- 
Iname.lii. 

— p Attempt to check portability to other dialects of C 

language. 

— u Suppress messages about function and external 

variables used and not defined or defined and not 
used. 

— V Suppress messages about unused arguments in 

functions. 

— X Do not report variables referred to by external 

declarations but never used. 



When more than one option is used, they should be combined 
into a single argument, such as — ab or — xha. 

The names of files that contain C language programs should 
end with the suffix .c which is mandatory for lint and the C 
compiler. 

The lint program accepts certain arguments, such as: 

-ly 

These arguments specify libraries that contain functions used 
in the C language program. The source code is tested for 
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compatibility with these libraries. This is done by accessing 
library description files whose names are constructed from the 
library arguments. These files all begin with the comment: 

/* LINTLIBRARY */ 

which is followed by a series of dummy function definitions. 
The critical parts of these definitions are the declaration of the 
function return type, whether the dummy function returns a 
value, and the number and types of arguments to the function. 
The VARARGS and ARGSUSED comments can be used to 
specify features of the library functions. 

The lint library files are processed almost exactly like ordinary 
source files. The only difference is that functions which are 
defined on a library file but are not used on a source file do not 
result in messages. The lint program does not simulate a full 
library search algorithm and will print messages if the source 
files contain a redefinition of a library routine. 

By default, lint checks the programs it is given against a 
standard library file which contains descriptions of the 
programs which are normally loaded when a C language 
program is run. When the — p option is used, another file is 
checked containing descriptions of the standard library routines 
which are expected to be portable across various machines. The 
— n option can be used to suppress all library checking. 



TYPES OF MESSAGES 

The following paragraphs describe the major categories of 
messages printed by lint. 
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Unused Variables and Functions 

As sets of programs evolve and develop, previously used 
variables and arguments to functions may become unused. It is 
not uncommon for external variables or even entire functions to 
become unnecessary and yet not be removed from the source. 
These types of errors rarely cause working programs to fail, but 
are a source of inefficiency and make programs harder to 
understand and change. Also, information about such unused 
variables and functions can occasionally serve to discover bugs. 

The lint program prints messages about variables and 
functions which are defined but not otherwise mentioned. An 
exception is variables which are declared through explicit 
extern statements but are never referenced; thus the 
statement 



extern double sin( ); 

will evoke no comment if sin is never used. Note that this 
agrees with the semantics of the C compiler. In some cases, 
these unused external declarations might be of some interest 
and can be discovered by using the — x option with the lint 
command. 

Certain styles of programming require many functions to be 
written with similar interfaces; frequently, some of the 
arguments may be unused in many of the calls. The — v option 
is available to suppress the printing of messages about unused 
arguments. When — v is in effect, no messages are produced 
about unused arguments except for those arguments which are 
unused and also declared as register arguments. This can be 
considered an active (and preventable) waste of the register 
resources of the machine. 

Messages about unused arguments can be suppressed for one 
function by adding the comment: 
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/* ARGSUSED V 

to the program before the function. This has the effect of the 
—V option for only one function. Also, the comment: 

/* VARARGS V 

can be used to suppress messages about variable number of 
arguments in calls to a function. The comment should be added 
before the function definition. In some cases, it is desirable to 
check the first several arguments and leave the later arguments 
unchecked. This can be done with a digit giving the number of 
arguments which should be checked. For example: 

/* VARARGS2 V 

will cause only the first two arguments to be checked. 

There is one case where information about unused or undefined 
variables is more distracting than helpful. This is when lint is 
applied to some but not all files out of a collection which are to 
be loaded together. In this case, many of the functions and 
variables defined may not be used. Conversely, many functions 
and variables defined elsewhere may be used. The — u option 
may be used to suppress the spurious messages which might 
otherwise appear. 

Set/Used Information 

The lint program attempts to detect cases where a variable is 
used before it is set. The lint program detects local variables 
(automatic and register storage classes) whose first use appears 
physically earlier in the input file than the first assignment to 
the variable. It assumes that taking the address of a variable 
constitutes a "use", since the actual use may occur at any later 
time, in a data dependent fashion. 
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The restriction to the physical appearance of variables in the 
file makes the algorithm very simple and quick to implement 
since the true flow of control need not be discovered. It does 
mean that lint can print messages about some programs which 
are legal, but these programs would probably be considered bad 
on stylistic grounds. Because static and external variables are 
initialized to zero, no meaningful information can be discovered 
about their uses. The lint program does deal with initialized 
automatic variables. 

The set/used information also permits recognition of those local 
variables which are set and never used. These form a frequent 
source of inefficiencies and may also be symptomatic of bugs. 



Flow of Control 

The lint program attempts to detect unreachable portions of 
the programs which it processes. It will print messages about 
unlabeled statements immediately following goto, break, 
continue, or return statements. An attempt is made to 
detect loops which can never be left at the bottom and to 
recognize the special cases while(l) and for(;;) as infinite 
loops. The lint program also prints messages about loops 
which cannot be entered at the top. Some valid programs may 
have such loops which are considered to be bad style at best 
and bugs at worst. 

The lint program has no way of detecting functions which are 
called and never returned. Thus, a call to exit may cause an 
unreachable code which lint does not detect. The most serious 
effects of this are in the determination of returned function 
values (see "Function Values"). If a particular place in the 
program cannot be reached but it is not apparent to lint, the 
comment 

/* NOTREACHED */ 
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can be added at the appropriate place. This comment will 
inform lint that a portion of the program cannot be reached. 

The lint program will not print a message about unreachable 
break statements. Programs generated by yacc and 
especially lex may have hundreds of unreachable break 
statements. The — O option in the C compiler will often 
eliminate the resulting object code inefficiency. Thus, these 
unreached statements are of little importance. There is 
typically nothing the user can do about them, and the resulting 
messages would clutter up the lint output. If these messages 
are desired, lint can be invoked with the — b option. 



Function Values 

Sometimes functions return values that are never used. 
Sometimes programs incorrectly use function 'Values" that 
have never been returned. The lint program addresses this 
problem in a number of ways. 

Locally, within a function definition, the appearance of both 

return( expr ); 
and 

return ; 

statements is cause for alarm; the lint program will give the 
message 

function name contains return(e) and return 

The most serious difficulty with this is detecting when a 
function return is implied by flow of control reaching the end 
of the function. This can be seen with a simple example: 
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f(a){ 

if ( a ) return ( 3 ); 

g(); 

} 

Notice that, if a tests false, / will call g and then return with 
no defined return value; this will trigger a message from lint. 
If g, like exit, never returns, the message will still be produced 
when in fact nothing is wrong. 

In practice, some potentially serious bugs have been discovered 
by this feature. 

On a global scale, lint detects cases where a function returns a 
value that is sometimes or never used. When the value is never 
used, it may constitute an inefficiency in the function 
definition. When the value is sometimes unused, it may 
represent bad style (e.g., not testing for error conditions). 

The dual problem, using a function value when the function 
does not return one, is also detected. This is a serious problem. 



Type Checking 

The lint program enforces the type checking rules of C 
language more strictly than the compilers do. The additional 
checking is in four major areas: 

• Across certain binary operators and implied assignments 

• At the structure selection operators 

• Between the definition and uses of functions 

• In the use of enumerations. 
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There are a number of operators which have an implied 
balancing between types of the operands. The assignment, 
conditional ( ?: ), and relational operators have this property. 
The argument of a return statement and expressions used in 
initialization suffer similar conversions. In these operations, 
char, short, int, long, unsigned, float, and double types 
may be freely intermixed. The types of pointers must agree 
exactly except that arrays of x's can, of course, be intermixed 
with pointers to x's. 

The type checking rules also require that, in structure 
references, the left operand of the — > be a pointer to structure, 
the left operand of the . be a structure, and the right operand 
of these operators be a member of the structure implied by the 
left operand. Similar checking is done for references to unions. 

Strict rules apply to function argument and return value 
matching. The types float and double may be freely matched, 
as may the types char, short, int, and unsigned. Also, 
pointers can be matched with the associated arrays. Aside 
from this, all actual arguments must agree in type with their 
declared counterparts. 

With enumerations, checks are made that enumeration 
variables or members are not mixed with other types or other 
enumerations and that the only operations applied are =, 
initialization, ==, !=, and function arguments and return 
values. 

If it is desired to turn off strict type checking for an expression, 
the comment 

/* NOSTRICT */ 

should be added to the program immediately before the 
expression. This comment will prevent strict type checking for 
only the next line in the program. 
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Type Casts 

The type cast feature in C language was introduced largely as 
an aid to producing more portable programs. Consider the 
assignment 

P = i; 

where p is a character pointer. The lint program will print a 
message as a result of detecting this. Consider the assignment 

p = (char *)1 ; 

in which a cast has been used to convert the integer to a 
character pointer. The programmer obviously had a strong 
motivation for doing this and has clearly signaled his 
intentions. It seems harsh for lint to continue to print 
messages about this. On the other hand, if this code is moved 
to another machine, such code should be looked at carefully. 
The — c flag controls the printing of comments about casts. 
When — c is in effect, casts are treated as though they were 
assignments subject to messages; otherwise, all legal casts are 
passed without comment, no matter how strange the type 
mixing seems to be. 



Nonportable Character Use 

On some systems, characters are signed quantities with a range 
from -128 to 127. On other C language implementations, 
characters take on only positive values. Thus, lint will print 
messages about certain comparisons and assignments as being 
illegal or nonportable. For example, the fragment 

char c; 

if( (c = getchar( )) < ) . . . 
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will work on one machine but will fail on machines where 
characters always take on positive values. The real solution is 
to declare c as an integer since getchar is actually returning 
integer values. In any case, lint will print the message 
"nonportable character comparison". 

A similar issue arises with bit fields. When assignments of 
constant values are made to bit fields, the field may be too 
small to hold the value. This is especially true because on some 
machines bit fields are considered as signed quantities. While 
it may seem logical to consider that a two-bit field declared of 
type int cannot hold the value 3, the problem disappears if the 
bit field is declared to have type unsigned. 



Strange Constructions 

Several perfectly legal, but somewhat strange, constructions are 
detected by lint. The messages hopefully encourage better code 
quality, clearer style, and may even point out bugs. The — h 
option is used to supress these checks. For example, in the 
statement 



*P++; 

the * does nothing. This provokes the message "null effect" 
from lint. The following program fragment: 

unsigned x ; 
if ( X < ) . . . 

results in a test that will never succeed. Similarly, the test 

if ( X > ) . . . 

is equivalent to 
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if( X != ) 



which may not be the intended action. The lint program will 
print the message "degenerate unsigned comparison" in these 
cases. If a program contains something similar to 

if(l!=0)... 

lint will print the message "constant in conditional context" 
since the comparison of 1 with gives a constant result. 

Another construction detected by lint involves operator 
precedence. Bugs which arise from misunderstandings about 
the precedence of operators can be accentuated by spacing ind 
formatting, making such bugs extremely hard to find. For 
example, the statement 

if( x&077==0)... 



or 



x«2 + 40 

probably do not do what was intended. The best solution is to 
parenthesize such expressions, and lint encourages this by an 
appropriate message. 

Finally, when the — h option has not been used, lint prints 
messages about variables which are redeclared in inner blocks 
in a way that conflicts with their use in outer blocks. This is 
legal but is considered to be bad style, usually unnecessary, 
and frequently a bug. 
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Old Syntax 

Several forms of older syntax are now illegal. These fall into 
two classes, assignment operators and initialization. 

The older forms of assignment operators (e.g., =+, =— , ...) 
could cause ambiguous expressions, such as: 

a =-1 ; 

which could be taken as either 

a =- 1 ; 



or 



a = -1 ; 

The situation is especially perplexing if this kind of ambiguity 
arises as the result of a macro substitution. The newer and 
preferred operators (e.g., +=, — =, ...) have no such 
ambiguities. To encourage the abandonment of the older forms, 
lint prints messages about these old-fashioned operators. 

A similar issue arises with initialization. The older language 
allowed 

int X 1 ; 

to initialize x to 1. This also caused syntactic difficulties. For 
example, the initialization 

int X ( -1 ) ; 
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looks somewhat like the beginning of a function definition: 

int X ( y ) { . . . 

and the compiler must read past x in order to determine the 
correct meaning. Again, the problem is even more perplexing 
when the initializer involves a macro. The current syntax 
places an equals sign between the variable and the initializer: 

int X = -1 ; 

This is free of any possible syntactic ambiguity. 

Pointer Alignment 

Certain pointer assignments may be reasonable on some 
machines and illegal on others due entirely to alignment 
restrictions. The lint program tries to detect cases where 
pointers are assigned to other pointers and such alignment 
problems might arise. The message "possible pointer alignment 
problem" results from this situation. 

Multiple Uses and Side Effects 

In complicated expressions, the best order in which to evaluate 
subexpressions may be highly machine dependent. For 
example, on machines (like the PDP-11) in which the stack runs 
backwards, function arguments will probably be best evaluated 
from right to left. On machines with a stack running forward, 
left to right seems most attractive. Function calls embedded as 
arguments of other functions may or may not be treated 
similarly to ordinary arguments. Similar issues arise with 
other operators which have side effects, such as the assignment 
operators and the increment and decrement operators. 

In order that the efficiency of C language on a particular 
machine not be unduly compromised, the C language leaves the 
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order of evaluation of complicated expressions up to the local 
compiler. In fact, the various C compilers have considerable 
differences in the order in which they will evaluate complicated 
expressions. In particular, if any variable is changed by a side 
effect and also used elsewhere in the same expression, the 
result is explicitly undefined. 

The lint program checks for the important special case where a 
simple scalar variable is affected. For example, the statement 

a[i] = b[i++]; 
will cause lint to print the message 

warning: i evaluation order undefined 
in order to call attention to this condition. 
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Chapter 7 

SYMBOLIC DEBUGGING 
PROGRAM— "sdb" 



GENERAL 

This chapter describes the symbolic debugger sdb(l) as 
implemented for C language programs on the UNIX operating 
system. The sdb program is useful both for examining ''core 
images" of aborted programs and for providing an environment 
in which execution of a program can be monitored and 
controlled. 



The sdb program allows interaction with a debugged program 
at the source language level. When debugging a core image 
from an aborted program, sdb reports which line in the source 
program caused the error and allows all variables to be 
accessed symbolically and displayed in the correct format. 

Breakpoints may be placed at selected statements or the 
program may be single stepped on a line-by-line basis. To 
facilitate specification of lines in the program without a source 
listing, sdb provides a mechanism for examining the source 
text. Procedures may be called directly from the debugger. 
This feature is useful both for testing individual procedures and 
for calling user-provided routines which provided formatted 
printout of structured data. 



USAGE 

In order to use the full capabilities of sdb, it is necessary to 
compile the source program with the — g option. This causes 
the compiler to generate additional information about the 
variables and statements of the compiled program. When the 
— g option has been specified, sdb can be used to obtain a trace 
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of the called functions at the time of the abort and interactively 
display the values of variables. 

A typical sequence of shell commands for debugging a core 
image is 

$ cc -g prgm.c -o prgm 

$ prgm 

Bus error - core dumped 

$ sdb prgm 

main:25: x[i] = 0; 



The program prgm was compiled with the — g option and then 
executed. An error occurred which caused a core dump. The 
sdb program is then invoked to examine the core dump to 
determine the cause of the error. It reports that the bus error 
occurred in function main at line 25 (line numbers are always 
relative to the beginning of the file) and outputs the source text 
of the offending line. The sdb program then prompts the user 
with an * indicating that it awaits a command. 

It is useful to know that sdb has a notion of current function 
and current line. In this example, they are initially set to main 
and "25", respectively. 

In the above example, sdb was called with one argument, 
prgm.. In general, it takes three arguments on the command 
line. The first is the name of the executable file which is to be 
debugged; it defaults to a.out when not specified. The second is 
the name of the core file, defaulting to core; and the third is 
the name of the directory containing the source of the program 
being debugged. The sdb program currently requires all source 
to reside in a single directory. The default is the working 
directory. In the example, the second and third arguments 
defaulted to the correct values, so only the first was specified. 
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It is possible that the error occurred in a function which was 
not compiled with the — g option. In this case, sdb prints the 
function name and the address at which the error occurred. 
The current line and function are set to the first executable line 
in main. The sdb program will print an error message if ynain 
was not compiled with the — g option, but debugging can 
continue for those routines compiled with the — g option. 
Figure 7-1 shows a typical example of sdb usage. 



Printing a Stack Trace 

It is often useful to obtain a listing of the function calls which 
led to the error. This is obtained with the t command. For 
example: 

*t 

sub(x=2,y=3) [prgm.c:25] 

inter(i=16012) [prgm.c:96] 

main(argc=l,argv=0x7fffff54,envp=0x7fffff5c)[prgm.c:15] 

This indicates that the error occurred within the function sub 
at line 25 in file prgm.c. The sub function was called with the 
arguments x=2 and y=3 from inter at line 96. The inter 
function was called from main at line 15. The main function is 
always called by the shell with three arguments often referred 
to as argc, argv, and envp. Note that argv and envp are 
pointers, so their values are printed in hexadecimal. 



Examining Variables 

The sdb program can be used to display variables in the 
stopped program. Variables are displayed by typing their name 
followed by a slash, so 



''errflag/ 
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causes sdb to display the value of variable errflag. Unless 
otherwise specified, variables are assumed to be either local to 
or accessible from the current function. To specify a different 
function, use the form 

*sub:i/ 



to display variable i in function sub. ¥11 users can specify a 
common block variable in the same manner. 



The sdb program supports a limited form of pattern matching 
for variable and function names. The symbol * is used to 
match any sequence of characters of a variable name and ? to 
match any single character. Consider the following commands 



*x*/ 
*sub:y?/ 



The first prints the values of all variables beginning with x, the 
second prints the values of all two letter variables in function 
sub beginning with y, and the last prints all variables. In the 
first and last examples, only variables accessible from the 
current function are printed. The command 

displays the variables for each function on the call stack. 

The sdb program normally displays the variable in a format 
determined by its type as declared in the source program. To 
request a different format, a specifier is placed after the slash. 
The specifier consists of an optional length specification 
followed by the format. The length specifiers are: 
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b One byte 

h Two bytes (half word) 

1 Four bytes (long word). 

The lengths are effective only with the formats d, o, x, and u. 
If no length is specified, the word length of the host machine is 
used. A numeric length specifier may be used for the s or a 
commands. These commands normally print characters until 
either a null is reached or 128 characters are printed. The 
number specifies how many characters should be printed. 

There are a number of format specifiers available: 

c Character. 

d Decimal. 

u Decimal unsigned. 

o Octal. 

X Hexadecimal. 

f 32-bit single-precision floating point. 

g 64-bit double-precision floating point. 

s Assume variable is a string pointer and print 

characters starting at the address pointed to by 
the variable until a null is reached. 

a Print characters starting at the variable's address 

until a null is reached. 

p Pointer to function. 
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i Interpret as a machine-language instruction. 

For example, the variable i can be displayed with 

*i/x 

which prints out the value of i in hexadecimal. 

The sdb program also knows about structures, arrays, and 
pointers so that all of the following commands work. 

♦array [2] [3]/ 
*sym.id/ 
*psym->usage/ 
*xsym[20].p->usage/ 

The only restriction is that array subscripts must be numbers. 
Depending on your machine, accessing arrays may be limited to 
1-dimensional arrays. Note that as a special case: 

*psym->/d 

displays the location pointed to by psym in decimal. 

Core locations can also be displayed by specifying their absolute 
addresses. The command 

*1024/ 

displays location 1024 in decimal. As in C language, numbers 
may also be specified in octal or hexadecimal so the above 
command is equivalent to both 
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*02000/ 
and 

*0x400/ 
It is possible to mix numbers and vaf iables so that 

*1000.x/ 
refers to an element of a structure starting at address 1000, and 

*1000->x/ 

refers to an element of a structure whose address is at 1000. 
For commands of the type *1000.x/ and *1000->x/, the sdb 
program uses the structure template of the last structure 
referenced. 

The address of a variable is printed with the =, so 
*i= 

displays the address of i. Another feature whose usefulness 
will become apparent later is the command 

*./ 

which redisplays the last variable typed. 



7-7 



sdb 



SOURCE FILE DISPLAY AND 
MANIPULATION 

The sdb program has been designed to make it easy to debug a 
program without constant reference to a current source listing. 
Facilities are provided which perform context searches within 
the source files of the program being debugged and display 
selected portions of the source files. The commands are similar 
to those of the UNIX system text editor ed(l). Like the editor, 
sdb has a notion of current file and line within the file. The 
sdb program also knows how the lines of a file are partitioned 
into functions, so it also has a notion of current function. As 
noted in other parts of this document, the current function is 
used by a number of sdb commands. 



Displaying the Source File 

Four commands exist for displaying lines in the source file. 
They are useful for perusing the source program and for 
determining the context of the current line. The commands 
are: 



p Prints the current line. 

w Window; prints a window of ten lines around 

the current line. 

z Prints ten lines starting at the current line. 

Advances the current line by ten. 

control-d Scrolls; prints the next ten lines and advances 

the current line by ten. This command is used 
to cleanly display long segments of the 
program. 

When a line from a file is printed, it is preceded by its line 
number. This not only gives an indication of its relative 
position in the file but is also used as input by some sdb 
commands. 
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Changing the Current Source File or Function 

The e command is used to change the current source file. 
Either of the forms 



*e function 
*e file.c 

may be used. The first causes the file containing the named 
function to become the current file, and the current line 
becomes the first line of the function. The other form causes 
the named file to become current. In this case, the current line 
is set to the first line of the named file. Finally, an e command 
with no argument causes the current function and file named to 
be printed. 



Changing the Current Line in the Source File 

The z and control-d commands have a side effect of changing 
the current line in the source file. The following paragraphs 
describe other commands that change the current line. 



There are two commands for searching for instances of regular 
expressions in source files. They are 

*/regular expression/ 
*?regular expression? 

The first command searches forward through the file for a line 
containing a string that matches the regular expression and the 
second searches backwards. The trailing / and ? may be 
omitted from these commands. Regular expression matching is 
identical to that of ed(l). 

The + and — commands may be used to move the current line 
forwards or backwards by a specified number of lines. Typing 
a new-line advances the current line by one, and typing a 
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number causes that line to become the current line in the file. 
These commands may be combined with the display commands 
so that 

*+15z 

advances the current line by 15 and then prints ten lines. 



A CONTROLLED ENVIRONMENT FOR 
PROGRAM TESTING 

One very useful feature of sdb is breakpoint debugging. After 
entering sdb, certain lines in the source program may be 
specified to be breakpoints. The program is then started with 
an sdb command. Execution of the program proceeds as 
normal until it is about to execute one of the lines at which a 
breakpoint has been set. The program stops and sdb reports 
the breakpoint where the program stopped. Now, sdb 
commands may be used to display the trace of function calls 
and the values of variables. If the user is satisfied the program 
is working correctly to this point, some breakpoints can be 
deleted and others set; then program execution may be 
continued from the point where it stopped. 

A useful alternative to setting breakpoints is single stepping. 
The sdb program can be requested to execute the next line of 
the program and then stop. This feature is especially useful for 
testing new programs, so they can be verified on a statement- 
by-statement basis. If an attempt is made to single step 
through a function which has not been compiled with the — g 
option, execution proceeds until a statement in a function 
compiled with the — g option is reached. It is also possible to 
have the program execute one machine level instruction at a 
time. This is particularly useful when the program has not been 
compiled with the — g option. 
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Setting and Deleting Breakpoints 

Breakpoints can be set at any line in a function which contains 
executable code. The command format is: 



*12b 

*proc:12b 

*proc:b 



The first form sets a breakpoint at line 12 in the current file. 
The line numbers are relative to the beginning of the file as 
printed by the source file display commands. The second form 
sets a breakpoint at line 12 of function proc, and the third sets 
a breakpoint at the first line of proc. The last sets a 
breakpoint at the current line. 

Breakpoints are deleted similarly with the commands 

*12d 

*proc:12d 

*proc:d 

In addition, if the command d is given alone, the breakpoints 
are deleted interactively. Each breakpoint location is printed, 
and a line is read from the user. If the line begins with a y or 
d, the breakpoint is deleted. 

A list of the current breakpoints is printed in response to a B 
command, and the D command deletes all breakpoints. It is 
sometimes desirable to have sdb automatically perform a 
sequence of commands at a breakpoint and then have execution 
continue. This is achieved with another form of the b 
command. 

*12b t;x/ 
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causes both a trace back and the value of x to be printed each 
time execution gets to line 12. The a command is a variation of 
the above command. There are two forms: 

*proc:a 
*proc:12a 

The first prints the function name and its arguments each time 
it is called, and the second prints the source line each time it is 
about to be executed. For both forms of the a command, 
execution continues after the function name or source line is 
printed. 



Running the Program 

The r command is used to begin program execution. It restarts 
the program as if it were invoked from the shell. The 
command 



*r args 



runs the program with the given arguments as if they had been 
typed on the shell command line. If no arguments are 
specified, then the arguments from the last execution of the 
program are used. To run a program with no arguments, use 
the R command. 

After the program is started, execution continues until a 
breakpoint is encountered, a signal such as INTERRUPT or QUIT 
occurs, or the program terminates. In all cases after an 
appropriate message is printed, control returns to sdb. 

The c command may be used to continue execution of a stopped 
program. A line number may be specified, as in: 
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*proc:12c 

This places a temporary breakpoint at the named line. The 
breakpoint is deleted when the c command finishes. There is 
also a c command which continues but passes the signal which 
stopped the program back to the program. This is useful for 
testing user-written signal handlers. Execution may be 
continued at a specified line with the g command. For 
example: 



17 g 



continues at line 17 of the current function. A use for this 
command is to avoid executing a section of code which is known 
to be bad. The user should not attempt to continue execution in 
a function different than that of the breakpoint. 

The s command is used to run the program for a single line. It 
is useful for slowly executing the program to examine its 
behavior in detail. An important alternative is the S command. 
This command is like the s command but does not stop within 
called functions. It is often used when one is confident that the 
called function works correctly but is interested in testing the 
calling routine. 

The i command is used to run the program one machine level 
instruction at a time while ignoring the signal which stopped 
the program. Its uses are similar to the s command. There is 
also an I command which causes the program to execute one 
machine level instruction at a time, but also passes the signal 
which stopped the program back to the program. 
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Calling Functions 

It is possible to call any of the functions of the program from 
sdb. This feature is useful both for testing individual functions 
with different arguments and for calling a function which 
prints structured data in a nice way. There are two ways to 
call a function: 



*proc(argl, arg2, . . .) 
*proc(argl, arg2, . . .)/m 

The first simply executes the function. The second is intended 
for calling functions (it executes the function and prints the 
value that it returns). The value is printed in decimal unless 
some other format is specified by m. Arguments to functions 
may be integer, character or string constants, or values of 
variables which are accessible from the current function. 

An unfortunate bug in the current implementation is that if a 
function is called when the program is not stopped at a 
breakpoint (such as when a core image is being debugged) all 
variables are initialized before the function is started. This 
makes it impossible to use a function which formats data from 
a dump. 



MACHINE LANGUAGE DEBUGGING 

The sdb program has facilities for examining programs at the 
machine language level. It is possible to print the machine 
language statements associated with a line in the source and to 
place breakpoints at arbitrary addresses. The sdb program can 
also be used to display or modify the contents of the machine 
registers. 
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Displaying Machine Language Statements 

To display the machine language statements associated with 
line 25 in function main, use the command 



*main:25? 

The ? command is identical to the / command except that it 
displays from text space. The default format for printing text 
space is the i format which interprets the machine language 
instruction. The control-d command may be used to print the 
next ten instructions. 

Absolute addresses may be specified instead of line numbers by 
appending a : to them so that 

*0xl024:? 

displays the contents of address 0x1024- in text space. Note that 
the command 

*0xl024? 

displays the instruction corresponding to line 0x102^ in the 
current function. It is also possible to set or delete a 
breakpoint by specifying its absolute address: 

*0xl024:b 

sets a breakpoint at address 0x102^. 
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Manipulating Registers 

The X command prints the values of all the registers. Also, 
individual registers may be named instead of variables by 
appending a % to their name so that 

*r3% 

displays the value of register rS. 

OTHER COMMANDS 

To exit sdb, use the q command. 

The ! command is identical to that in ed(l) and is used to have 
the shell execute a command. 

It is possible to change the values of variables when the 
program is stopped at a breakpoint. This is done with the 
command 

*variable!value 

which sets the variable to the given value. The value may be a 
number, character constant, register, or the name of another 
variable. If the variable is of type float or double, the value can 
also be a floating-point constant. 
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$ cat testdiv2.c 
main(argc, argv, envp) 
char **argv, **envp; { 

int i; 

i = div2(-l); 

printfC -1/2 = %d\n",i); 

} 
div2(i) { 

int j; 

j = i»l; 

return(j); 

} 

$ cc -g testdiv2.c 

$ a.out 

-1/2 = -1 

$ sdb 

No core image # Warning message from sdb 

*/ div2 # Search for function " div2" 

7: div2(i) { # It starts on line 7 

*z # Print the next few lines 

7: div2(i) { 

int j; 

j = i»l; 
10: return(j); 

*div2:b # Place breakpoint at beginning of " div2" 

div2:9 b # Sdb echoes proc name and line number 

*r # Run the function 

a.out # Sdb echoes command line executed 

Breakpoint at # Executions stops just before line 9 

div2:9: j = i»l; 

*t # Print trace of subroutine calls 

div2(i=-l) [testdiv2.c:9] 

main(argc=l,argv=0x7fffff50,envp=0x7fffff58)[testdiv2.c:4] 

*i/ # Print i 

-1 

*s # Single step 

div2:10: return(j); # Execution stops before line 10 

*j/ # Print j 

-1 
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INTRODUCTION 

This is a reference manual for MAS, the UNIX System 
assembler for the Motorola 68010 [for historical reasons as(l) 
and inas(l) are synonymous]. Programmers familiar with the 
MC68010 should be able to program in MAS referring to this 
manual, but this is not a manual for the processor itself. 
Details about the effects of instructions, meaning of status 
register bits, handling of interrupts, and many other issues are 
not dealt with here. This manual, therefore, should be used in 
conjunction with the Motorola publication, MC68010 16-Bit 
Virtual Memory Microprocessor Manual. 



Warnings 

A few important warnings to the MAS user should be 
emphasized at the outset. Though for the most part there is a 
direct correspondence between MAS notation and the notation 
used in the MC68010 User's Manual, the following exceptions 
could lead the unsuspecting user to write incorrect code. 



Comparison Instructions 

First, the order of the operands in compare instructions follows 
one convention in the MC68010. Using the convention of the 
MC68010 User's Manual one might write 



CMPW D5,D3 Is (D3-D5) less than or 

equal to zero? 
BLE IS_LESS Branch if yes. 
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Using the MAS convention one would write rather 

cmp.w %d3, %d5 # Is (d3-d5) less than or 

# equal to zero? 
ble is_less # Branch if yes. 

MAS follows the convention used by other assemblers 
supported in the UNIX System (both the 3B20S and the VAX 
follow this convention). This convention makes for 
straightforward reading of compare-and-branch instruction 
sequences, but does nonetheless lead to the peculiarity that if a 
compare instruction is replaced by a subtract instruction, the 
effect on the condition codes will be entirely different. This 
may be confusing to programmers who are used to thinking of 
a comparison as a subtraction whose result is not stored. But 
users of MAS who become accustomed to the convention will 
find that both the compare and subtract notations make sense 
in their respective contexts. 



Overloading of Opcodes 

Another issue that users must be aware of arises from the 
MC68010's use of several different instructions to do more or 
less the same thing. For example, the MC68010 User's Manual 
lists the instructions SUB, SUBA, SUBI, and SUBQ, which all 
have the effect of subtracting their source operand from their 
destination operand. MAS provides the convenience of allowing 
all these operations to be specified by a single assembly 
instruction sub. On the basis of the operands given to the sub 
instruction, the MAS assembler selects the appropriate 
MC68010 operation code. 

The danger created by this convenience is that it could leave 
the misleading impression that all forms of the SUB operation 
are semantically identical. In fact, they are not. The careful 
reader of the MC68010 User's Manual will notice that whereas 
SUB, SUBI, and SUBQ all affect the condition codes in a 
consistent way, SUBA does not affect the condition codes at all. 
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Consequently, the MAS user must be aware that when the 
destination of a sub instruction is an address register (which 
causes the sub to be mapped into the operation code for SUBA), 
the condition codes will not be affected. 



USE Of THE ASSEMBLER 

The UNIX System command mas invokes the assembler and 
has the following syntax: 



mas [ — o output ] file 



or 



as [ — o output ] file 

This causes the named file to be assembled. The output of the 
assembly is left on the file output specified with the -o flag. If 
no such specification is made, the output is left in the file 
whose name is formed by removing the .s suffix, if there is one, 
from the input file name and appending a .0 suffix. 
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Format of Assembly Language Line 

Typical lines of MAS assembly code look like these: 

# Clear a block of memory at location %a3 

text 2 

mov . w &.const,%dl 
loop: clr.l (%a3)+ 

dbf %dl,loop # go back for const 

# repetitions 

init2 : clr.l count; clr.l credit; 
clr . 1 debit ; 

These general points about the example should be noted: 



An identifier occurring at the beginning of a line and 
followed by a colon (:) is a label. One or more labels may 
precede any assembly language instruction or pseudo- 
operation. See also Location Counters and Labels which 
follows. 

A line of assembly code need not include an instruction. It 
may consist of a comment alone (introduced by #), a label 
alone (terminated by :), or it may be entirely blank. 

It is good practice to use tabs to align assembly language 
operations and their operands into columns, but this is not 
a requirement of the assembler. An opcode may appear at 
the beginning of the line, if desired, and spaces may 
precede a label. A single blank or tab suffices to separate 
an opcode from its operands. Additional blanks and tabs 
are ignored by the assembler. 
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— It is permissible to write several instructions on one line by 
separating them by semicolons. The semicolon is 
syntactically equivalent to a newline. But a semicolon 
inside a comment is ignored. 



Comments 

Comments are introduced by the character # and continue to 
the end of the line. Comments may appear anywhere and are 
completely disregarded by the assembler. 



Identifiers 

An identifier is a string of characters taken from the set a-z, 
A-Z, _,-,%, and 0-9. The first character of an identifier must 
be a letter (upper or lower case) or an underscore. Upper and 
lower case letters are distinguished; 

con35 and CON35 

are two distinct identifiers. 

There is no limit on the length of an identifier. 

The value of an identifier is established by the set pseudo- 
operation (see Symbol Counter Control Operations) or by using 
it as a label (see Location Counters and Labels). 

The character ' has special significance to the assembler. A 
used alone, as an identifier, means "the current location." A 
used as the first character in an identifier becomes a "." in the 
symbol table, allowing symbols such as .eos and .Ofake to make 
it into the symbol table, as required by the Common Object File 
Format. 
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Register Identifiers 

A register identifier is an identifier preceded by the character 
%, and represents one of the MC68010 processor's registers. 
The predefined resister identifiers are: 



%dO 


%d4 


%aO 


%a4 


%cc 


%usp 


%d1 


%d5 


%a1 


$a5 


%pc 


%fp 


%d2 


%d6 


%a2 


%a6 


%sp 




%d3 


%d7 


%a3 


%a7 


%sr 





Note: The identifiers %a7 and %sp represent one and 
the same machine register. Likewise, %a6 and %fp are 
equivalent. Use of both %a7 and %sp, or %a6 and %fp, 
in the same program may result in confusion. 



Constants 

MAS deals only with integer constants. They may be entered 
in decimal, octal, or hexadecimal, or they may be entered as 
character constants. Internally, MAS treats all constants as 
32-bit binary two's complement quantities. 



Numerical Constants 

A decimal constant is a string of digits beginning with a non- 
zero digit. 

An octal constant is a string of digits beginning with zero. 

A hexadecimal constant consists of the characters Ox or OX 
followed by a string of characters from the set 0-9, a-f, and A- 
F. In hexadecimal constants, upper and lower case letters are 
not distinguished. 
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Examples 


: 


set 


const , 35 


mov . w 


Sc035,%dl 


set 


const , 0x35 


mov . w 


ScOxf f ,%dl 



# Decimal 35 

# Octal 35 (decimal 29) 

# Hex 35 (decimal 53) 

# Hex ff (decimal 255) 



Character Constants 

An ordinary character constant consists of single-quote (') 
followed by an arbitrary ASCII character other than \. The 
value of the constant is equal to the ASCII code for the 
character. Special meaning of characters are overridden when 
used in character constants; for example, if # is used, the # is 
not introducing a comment. 

A special character constant consists of '\ followed by another 
character. All the special constants, and examples of ordinary 
character constants, are listed here: 



Constant Value 



Meaning 



'\b 


0x08 


Backspace 


'\t 


0x09 


Horizontal Tab 


'\n 


OxOa 


Newline (Line Feed) 


'\v 


OxOb 


Vertical Tab 


\t 


OxOc 


Form Feed 


\r ■ 


OxOd 


Carriage Return 


W 


Ox05c 


Backslash (\) 


)> 


0x27 


Single-Quote 


'0 


0x30 


Zero 


'A 


0x41 


Capital A 


'a 


0x61 


Lower Case A 
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Other Syntactic Details 

A discussion of expression syntax appears in EXPRESSIONS. 
Information about the syntax of specific components of MAS 
instructions and pseudo-operations is given later in the sections 
entitled PSEUDO-OPERATIONS, SPAN-DEPENDENT 
OPTIMIZATION, and ADDRESS MODE SYNTAX. 



SEGMENTS, LOCATION COUNTERS, AND 

LABELS 



Segments 

A program in MAS assembly language may be broken into 
segments known as text, data, and bss segments. The 
convention regarding the use of these segments is to place 
instructions in text segments, initialized data in data segments, 
and uninitialized data in bss segments. However, the assembler 
does not enforce this convention; for example, it permits 
intermixing of instructions and data in a text segment. 

Primarily to simplify compiler code generation, the assembler 
permits up to four separate text segments and four separate 
data segments named 0, 1, 2, and 3. The assembly language 
program may switch freely between them by using assembler 
pseudo-operations. (See the section entitled Location Counter 
Control Operations.) When generating the object file, the 
assembler concatenates the text segments to generate a single 
text segment, and the data segments to generate a single data 
segment. Thus, the object file contains only one text segment 
and only one data segment. 

There is only one bss segment to begin with, and it maps 
directly into the object file. 

Because the assembler keeps together everything from a given 
segment when generating the object file, the order in which 
information appears in the object file may not be the same as 
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in the assembly language file. For example, if the data for a 
program consisted of 

data 1 # segment 1 

word 0x1111 

data # segment 

long Oxffffffff 

data 1 # segment 1 

byte 0x2222 

then equivalent object code would be generated by 

data 

long Oxffffffff 

word 0x1111 

word 0x2222 



Location Counters and Labels 

The assembler maintains separate location counters for the bss 
segment and for each of the text and data segments. The 
location counter for a given segment is incremented by one for 
each byte generated in that segment. 

The location counters allow values to be assigned to labels. 
When an identifier is used as a label in the assembly language 
input, the current value of the current location counter is 
assigned to the identifier. The assembler also keeps track of 
which segment the label appeared in. Thus, the identifier 
represents a memory location relative to the beginning of a 
particular segment. 
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TYPES 

Identifiers and expressions may have values of different types: 

— In the simplest case, an expression (or identifier) may have 
an absolute value, such as 29, -5000, or 262143. 

— An expression (or identifier) may have a value relative to 
the start of a particular segment. Such a value is known as 
a relocatable value. The memory location represented by 
such an expression cannot be known at assembly time, but 
the relative values (i.e. the difference) of two such 
expressions can be known if they refer to the same 
segment. 

Identifiers which appear as labels have relocatable values: 

— If an identifier is never assigned a value, it is assumed to 
be an undefined external. Such identifiers may be used 
with the expectation that their values will be defined in 
another program, and hence known at load time; but the 
relative values of undefined externals cannot be known. 



EXPRESSIONS 

For conciseness, the following abbreviations will be useful: 

abs absolute expression 
rel relocatable expression 
ext undefined external 

All constants are absolute expressions. An identifier may be 
thought of as an expression having the identifier's type. 
Expressions may be built up from lesser expressions using the 
operators +, -, *. and / according to the following type rules: 
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abs + abs = abs 

abs + rel = rel + abs = rel 

abs + ext = ext + abs = ext 



abs - 


- abs 


= abs 




rel - 


- abs 


= rel 




ext - 


- abs 


= ext 




rel - 


- rel 


= abs . 


» 
provided that the two 
relocatable expressions 
are relative to the 
same segment . 



abs * abs = abs 
abs / abs = abs 
- abs = abs 



Note: Use of a rel-rel expression is dangerous, 
particularly when dealing with identifiers from text- 
segments. The problem is that the assembler will 
determine the value of the expression before it has 
resolved all questions concerning span-dependent 
optimizations. Use this feature at your own risk! 



The unary minus operator takes the highest precedence; the 
next highest precedence is given to * and /, and lowest 
precedence is given to + and binary -. Parentheses may be 
used to coerce the order of evaluation. 

If the result of a division is a positive non-integer, it will be 
truncated towards zero. If the result is a negative non-integer, 
the direction of truncation cannot be guaranteed. 



8-11 



UNIX SYSTEM ASSEMBLER FOR UNIX PC 
PSEUDO-OPERATIONS 

Data Initialization Operations 

byte abs, abs,... 

One or more arguments, separated by 
commas, may be given. The values of the 
arguments are computed to produce 
successive bytes in the assembly output. 



short abs, abs,. 



long expr, expr,. 



One or more arguments, separated by 
commas, may be given. The values of the 
arguments are computed to produce 
successive 16-bit words in the assembly 
output. 



One or more arguments, separated by 
commas, may be given. Each expression 
may be absolute, relocatable, or undefined 
external. A 32-bit quantity is generated for 
each such argument (in the case of 
relocatable or undefined external 
expressions, the actual value may not be 
filled in until load time). 

Alternatively, the arguments may be bit- 
field expressions. A bit-field expression has 
the form 



n : value 



where both n and value denote absolute 
expression. The quantity n represents a 
field width; the low-order n bits of value 
become the contents of the bit-field. 
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Successive bit-fields fill up 32-bit long 
quantities starting with the high-order part. 
If the sum of the lengths of the bit-fields is 
less than 32 bits, the assembler creates a 
32-bit long with zeros filling out the low- 
order bits. For example, 

long 4:-1, 16:0x7f, 12:0, 5000 

and 
long 4:-1 , 16: 0x7f , 5000 

are equivalent to 

long OxfOOVfOOO, 5000 

Bit-fields may not span pairs of 32-bit longs. 
Thus, 

long 24:0xa, 24:0xb, 24:0xc 

yields the same thing as 

long OxOOOOOaOO, OxOOOOObOO, 
OxOOOOOcOO 

The value of abs is computed, and the 
resultant number of bytes of zero data is 
generated. For example, 

space 6 

is equivalent to 
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byte 0, 0, 0, 0, 0, 0, 



Symbol Counter Control Operations 

set identifier, expr 

The value of identifier is set equal to expr, 
which may be absolute or relocatable. 

comm identifier, abs 

The named identifier is to be assigned to a 
common area of size abs bytes. If identifier 
is not defined by another program, the 
loader will allocate space for it. 

The type of identifier becomes undefined 
external. 

Icomm identifier, abs 

The named identifier is assigned to a local 
common of size abs bytes. This results in 
allocation of space in the bss segment. 



global identifier 



The type of identifier becomes relocatable. 



This causes identifier to be externally 
visible. If identifier is defined in the 
current program, then declaring it global 
allows the loader to resolve references to 
identifier in other programs. 

If identifier is not defined in the current 
program, the assembler expects an external 
resolution; in this case, therefore, identifier 
is global by default. 



8-14 



UNIX SYSTEM ASSEMBLER FOR UNIX PC 

Location Counter Control Operations 

data abs 

The argument, if present, must evaluate to 
0, 1, 2, or 3; this indicates the number of the 
data segment into which assembly is to be 
directed. If no argument is present, 
assembly is directed into data segment 0. 



text abs 



org expr 



The argument, if present, must evaluate to 
0, 1, 2, or 3; this indicates the number of the 
text segment into which assembly is to be 
directed. If no argument is present, 
assembly is directed into text segment 0. 

Before the first data or text operation is 
encountered, assembly is by default directed 
into text segment 0. 



The current location counter is set to expr. 
Expr must represent a value in the current 
segment, and must not be less than the 
current location counter. 



The current location counter is rounded up 
to the next even value. 



Symbolic Debugging Operations 

The assembler allows for symbolic debugging information to be 
placed into the object code file with special pseudo-operations. 
The information typically includes line numbers and 
information about C language symbols, such as their type and 
storage class, the Motorola 68010 SGS C compiler generates 
symbolic debugging information when the -g option is used. 
Assembler programmers may also include such information in 
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source files. 



file and in 

The file pseudo-operation passes the name of the source file 
into the object file symbol table. It has the form 



file "filename" 

where filename consists of one to 14 characters. 

The in pseudo-operation makes a line number table entry in the 
object file. That is, it associates a line number with a memory 
location. Usually the memory location is the current location in 
text. The format is 

in line [, value] 

where line is the line number. The optional value is the 
address in text, data, or bss to associate with the line number, 
the default when value is omitted (which is usually the case) is 
the current location in text. 

Symbol Attribute Operations 

The basic symbolic testing pseudo-operations are def and endef. 
These operations enclose other pseudo-operations that assign 
attributes to a symbol and must be paired. 



def name 



endef 



# Attribute 

# Assigning 

# Operations 
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Note 1: def does not define the symbol, although it 
does create a symbol table entry. Because an undefined 
symbol is treated as external, a symbol which appears in 
a def, but which never acquires a value, will ultimately 
result in an error at link edit time. 



Note 2: To allow the assembler to calculate the sizes of 
functions for other SGS tools, each def/endef pair that 
defines a function name must be matched by a def/endef 
pair after the function in which a storage class of /-I is 
assigned. 



The paragraphs below describe the attribute-assigning 
operations. Keep in mind that all of these operations apply to 
the symbol name which appeared in the opening def pseudo- 
operation. 



val expr 



scl expr 



type expr 



Assigns the value expr to name. The type of 
the expression expr determines with which 
section name is associated. If value is -, the 
current location in the text section is used. 



Declares a storage class for niame. the 
expression expr must yield an ABSOLUTE 
value that corresponds to the C compiler's 
internal representation of a storage class. 
The special value -1 designates the physical 
end of a function. 



Declares the C language type of name. The 
expression expr must yield an ABSOLUTE 
value that corresponds to the C compiler's 
internal representation of a basic or derived 
type. 
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tag str 



line expr 



Associates name with the structure, 
enumeration, or union names str which 
must have already been declared with 
def/ended pair. 



Provides the line number of name, where 
name is a block symbol, the expression expr 
should yield an ABSOLUTE value that 
represents a line number. 

size expr 

Gives a size for name. The expression expr 
must yield an ABSOLUTE value. When 
name is a structure or an array with a 
predetermined extent, expr gives the size in 
bytes. For bit fields, the size is in bits. 

dim exprl, expr2,... 

Indicates that name is an array. Each of 
the expressions must yield an ABSOLUTE 
value that provides the corresponding array 
dimension. 



Switch Table Operation 

The MC68010 SGS C compiler generates a compact set of 
instructions for the C language switch construct, of which an 
example is shown below. 
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sub . 1 


%1 ,%dO 


cmp . 1 


%dO,&4 


bhi 


L%2 1 


add . w 


%dO,%dO 


mov . w 


10(%pc,%d0.w) ,%dO 


jmp 


6(%pc,%dO .w) 


swbeg 


&5 


L%22: 




short 


L%15-L%22 


short 


L%21-L%22 


short 


L%16-L%22 


short 


L%21-L%22 


short 


L%17-L%22 



The special swbeg pseudo-operation communicates to the 
assembler that the lines following it contain rel-rel 
subtractions. Remember that ordinarily such subtractions are 
risky because of span-dependent optimization. In this case, 
however, the assembler makes special allowances for the 
subtraction because the compiler guarantees that both symbols 
will be defined in the current assembler file, that one of the 
symbols is a fixed distance away from the current location. 

The swbeg pseudo-operation takes an argument that looks like 
an immediate operand. The argument is the number of lines 
that follow swbeg and that contain switch table entries. Swbeg 
inserts two words into text. The first is the ILLEGAL 
instruction code. The second is the number of table entries that 
follow. The Motorola 68010 SGS disassembler needs the 
ILLEGAL instruction as a hint that what follows is a switch 
table. Otherwise it would get confused when it tried to decode 
the table entries, differences between two symbols, as 
instructions. 
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SPAN-DEPENDENT OPTIMIZATION 

The assembler makes certain choices about the object code it 
generates based on the distance between an instruction and its 
operand(s). Choosing the smallest, fastest form is called span- 
dependent optimization. Span-dependent optimization occurs 
most obviously in the choice of object code for branches and 
jumps. It also occurs when an operand may be represented by 
the program counter relative address mode instead of as an 
absolute 2-word (long) address. The span-dependent 
optimization capability is normally enabled; the -n command 
line flag disables it. When this capability is disabled, the 
assembler makes worst-case assumptions about the types of 
object code that must be generated. 

In the MC68010 Software Generation System, the compiler 
generates branch instructions without a specific offset size. 
When the optimizer is used, it identifies branches which could 
be represented by the short form, and it changes the operation 
accordingly. The assembler chooses only between long 
and very-long representations for branches. 

Branch instructions, such as bra, bsr, bgt, and so on, can have 
either a byte or a word pc-relative address operand. A byte 
size specification should be used only when the user is sure that 
the address intended can be represented in the byte allowed. 
The assembler will take one of these instructions with 
a byte size specification and generate the byte form of 
the instruction without asking questions. 

Although the largest offset specification allowed is a word, 
large programs could conceivably have need for a branch to a 
location not reachable by a word displacement. Therefore, 
equivalent long forms of these instructions might be needed. 
When the assembler encounters a branch instruction without a 
size specification, or with a word size specification, it tries to 
choose between the long and very-long forms of the instruction. 
If the operand can be represented in a word, then the word 
form of the instruction will be generated. Otherwise the very- 
long form will be generated. For unconditional branches, e.g., 
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br, bra and bsr, the very-long form is just the equivalent jump 
(jmp and jsr) with an absolute address operand (instead of pc- 
relative). For conditional branches, the equivalent very-long 
form is a conditional branch around a jump, where the 
conditional test has been reversed. 

The following table summarizes span-dependent optimizations. 
The assembler chooses only between the long form and very- 
long form, while the optimizer chooses between the short and 
long form for branches (but not bsr). 



Assembler Span-Dependent Optimizations 


Instruction 


Short Form 


Long Form 


Very Long Form 


br,bra,bsr 


byte offset 


word offset 


jmp or jsr with 
absolute long 
address 


conditional 


byte offset 


word offset 


short conditional 


branch 






branch with 
reversed condition 
around jmp with 
absolute long 
address 


jmp,jsr 


- 


pc-relative 


absolute long 






address 


address 


lea,pea 


_ 


pc-relative 


absolute long 






address 


address 
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ADDRESS MODE SYNTAX 

The following table summarizes the MAS syntax for MC68010 
addressing modes. 

In the table, the letter n, as in An or Dn, an or dn, represents 
any digit from to 7. The notations Ri and ri represent any of 
the MC68010 data or address registers. 

The letter d, where it is used to represent a displacement, may 
stand for any absolute expression. 

It is important to note that expressions used for the Absolute 
addressing modes need not be absolute expressions in the sense 
defined in TYPES. Although the addresses used in those 
addressing modes must ultimately be filled in with constants, 
that can be done by the loader— there is no need for the 
assembler to be able to compute them. Indeed, the Absolute 
Long addressing mode is commonly used for accessing 
undefined external addresses. 

Effective Address Modes 



Motorola 


MAS 


Effective Address Mod 


Notation 


Notation 




Dn 


%dn 


Data Register Direct 


An 


%an 


Address Register Direct 


(An) 


(%an) 


Address Register Indirect 


An@+ 


(%an)+ 


Address Register Indirect 
with Postincrement 


An@- 


-(%an) 


Address Register Indirect 
with Predecrement 
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An@(d) 



An@(d,Ri.W) 
An@(d,Ri.L) 



d(%an) 



d(%an,%ri.w) 
d(%an,%ri.l) 



xxx.W 



XXX 



Address Register Indirect 
with Displacement 
(d signifies a signed 16-bit 
absolute displacement) 

Address Register 
Indirect with Index 
(d signifies a signed 
8-bit absolute 
displacement) 

Absolute Short Address 



xxx.L 



xxx 



PC@(d) 



d(%pc) 



PC @ (d,Ri. W) d( % pc. % n. w) 
PC @ (d,Ri.L) d( % pc, % n.l) 



#xxx 



&xxx 



(xxx signifies an expression 
yielding a signed 16-bit 
memory address) 

Absolute Long Address 

(xxx signifies an expression 
yielding a 32-bit memory 
address) 

Program Counter with 
Displacement 

(d signifies a signed 16- 
bit absolute displacement) 

Program Counter with Index 
(d signifies a signed 8-bit 
absolute displacement) 

Immediate Data 



(xxx signifies an absolute 
constant expression) 
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MACHINE INSTRUCTIONS 

The following table shows how MC68010 instructions should be 
written in order to be understood correctly by the MAS 
assembler. Several abbreviations are used in the table: 



S The letter S, as in add.S, stands for one of the operation 
size attribute letters b, w, or 1, representing a byte, word, 
or long operation. 

A The letter A, as in add.A, stands for one of the address 
operation size attribute letters w or 1, representing a word 
or long operation. 

CC In the context bCC, dbCC, and sCC, the letters CC 
represent any of the following condition code designations 
(except that f and t may not be used in the bCC 
instruction): 



CC 


carry cl 


ear 


Is 


low or same 


cs 


carry set 


It 


less than 


eq 


equal 




mi 


minus 


f 


false 




ne 


not equal 


ge 


greater 


or equal 


pl 


plus 


gt 


greater 


than 


t 


true 


hi 


high 




vc 


over clear 


hs 


high or 


same (=cc) 


vs 


overflow set 


le 


less or 


equal 






lo 


low ( =cs 


) 







EA This represents an arbitrary effective address. 

I An absolute expression, used as an immediate operand. 

Q An absolute expression evaluating to a number from 1 to 



L A label reference, or any expression representing a 
memory address in the current segment. % dx, % dy, % dn, 
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% ax, % ay, and % an represent registers. 

MC68010 Instruction Formats 
Operation MAS Syntax Meaning 



ABCD 


abcd.b 


%dy,%dx 

-(%ay) 

-(%ax) 


Add Decimal with 
Extend 


ADD 


add.S 


EA, %dn 
%dn,EA 


Add Binary 


ADDA 


add.A 


EA, %an 


Add Address 


ADDI 


add.S 


&I,EA 


Add Immediate 


ADDQ 


add.S 


&Q,EA 


Add Quick 


ADDX 


addx.S 


%dy,%dx 

-(%ay) 

-(%ax) 


Add Extended 


AND 


and.S 


EA, %dn 

%dn,EA 


AND Logical 


ANDI 


and.S 


&I,EA 


AND Immediate 


ANDI 
toCCR 


and.b 


&I,%cc 


AND Immediate 
to Condition Codes 


ANDI 
to SR 


and.w 


&I,%sr 


AND Immediate 

to the Status Register 


ASL 


asl.S 


%ds,%dy 
&Q,%dy 


Arithmetic Shift (Left) 



asl.w 



&I,EA 
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Operation 



BCLR 



BRA 



MC68010 Instruction Formats 

MAS Syntax Meaning 



ASR 


asr.S 


%dx,%dy 
&Q,%dy 


Arithmetic Shift (Right) 




asr.w 


&1,EA 




Bcc 


bCC 
bCC.b 


L 
L 


Branch Conditionally 

(16-bit Displacement) 

Branch Conditionally 

(Short) 

(8-bit Displacement) 


BCHG 


bchg 


%dn,EA 
&I,EA 


Test a Bit and Change 



bclr 



bra 



bra.b 



br 



%dn,EA 
&I,EA 



Note: bchg should be 
written with no suffix. 
If the second operand is 
a data register, .1 is 
assumed; otherwise .b is. 

Test a Bit and Clear 

Note: bclr should be 
written with no suffix. 
If the second operand 
is a data register, .1 is 
assumed; otherwise .b is. 

Branch Always 
(16-bit Displacement) 

Branch Always (Short) 
(8-bit Displacement) 

Same as bra 
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MC68010 Instruction Formats 

MAS Syntax Meaning 



BSET 



BSR 



br.b 
bset 



bsr 



%dn,EA 
&I,EA 



Same as bra.b 

Test a Bit and Set 

Note: bset should be 
written with no suffix. 
If the second operand is a 
data register, .1 is 
assumed; otherwise .b is. 

Branch to Subroutine 
(16-bit Displacement) 





bsr.b 


L 


Branch to Subroutine 

(Short) 

(8-bit Displacement) 


BTST 


btst 


%dn,EA 
&I,EA 


Test a Bit and Set 

Note: btst should be 
written with no suffix. 
If the second operand is a 
data register, .1 is 
assumed; otherwise .b is. 


CHK 


chk.w 


EA,%dn 


Check Register Against 
Bounds 


CLR 


clr.S 


EA 


Clear an Operand 


CMP 


cmp.S 


%dn,EA 


Compare 


CMPA 


cmp.A 


%an,EA 


Compare Address 
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Operation 



MC68010 Instruction Formats 

MAS Syntax Meaning 



CMPI 
CMPM 



cmp.S 
cmp.S 



EA,&I 

(%ax)+ 
(%ay)+ 



Compare Immediate 
Compare Memory 



Note: Order of operands 
in MAS is reverse of 
that in MC68010 User's 
Manual 



DBcc 


dbCC 


%dn,L 


Test Condition, 
Decrement, and Branch 




dbra 


%dn,L 


Decrement and Branch 
Always 




dbr 


%dn,L 


Same as dbra 


DIVS 


divs.w 


EA,%dn 


Signed Divide 


DIVU 


divu.w 


EA,%dn 


Unsigned Divide 


EOR 
EORI 


eor.S 
eor.S 


%dn,EA 
&I,EA 


Exclusive OR Logical 
Exclusive OR Immediate 


EORI 
toCCR 


eor.b 


&I,%cc 


Exclusive OR Immediate 
to Condition Codes 


EORI 
to SR 


eor.w 


&I,%ar 


Exclusive OR Immediate 
to the Status Register 


EXG 


exg 


%rx,%ry 


Exchange Registers 


EXT 


ext.A 


%dn 


Sign Extend 
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MC68010 Instruction Formats 

MAS Syntax Meaning 



JMP 


jmp 


EA 


Jump 


JSR 


jsr 


EA 


Jump to Subroutine 


LEA 


lea] 


EA,%an 


Load Effective Address 


LINK 


link 


%an,&I 


Link and Allocate 


LSL 


Isl.S 


%dx,%dy 
&Q,%dy 


Logical Shift (Left) 




Isl.w 


&I,EA 




LSR 


Isr.S 


%dx,%dy 
&Q,%dy 


Logical Shift (Right) 




Isr.w 


&I,EA 




MOVE 


mov.S 


EA,EA 


Move Data from Source 



to Destination 

Note: If the destination 
is an address register, 
the instruction generated 
is MOVEA. 



MOVE 
toCCR 



MOVE 
toSR 



mov.w EA,%cc Move to Condition Codes 



MOVE mov.w 

from CCR 



% ar,EA Move from Condition Codes 



mov.w EA,%ar Move to Status Register 
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Operation 



MOVEQ 



MOVES 



MC68010 Instruction Formats 

MAS Syntax Meaning 



MOVE 
from SR 


mov.w 


%ar,EA 


Move from Status Register 


MOVE 
USP 


mov.l 


%usp,%an 
%an,%usp 


Move User Stack Pointer 


MOVEA 


mov.A 


EA,%an 


Move Address 


MOVEC 






Move Control Register 


MOVEM 


movm.A 


&I,EA 
EA,&I 


Move Multiple registers 



Note: Immediate operand 
is a mask designating 
which registers are to 
be moved to memory or 
which registers are to 
receive memory data. 
Not all addressing modes 
are permitted, and the 
correspondence between 
mask bits and register 
numbers depends on the 
addressing mode used. 
See MC68010 User's Manual 
for details. 

mov.l &I,%dn Move Quick (when I fits 

in byte) 



movs.S EA,EA 



Move Alternate Address 
Space 
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Operation 



MC68010 Instruction Formats 

MAS Syntax Meaning 



MULS 


muls.w 


EA,%dn 


Signed Multiply 


MULU 


mulu.w 


EA,%dn 


Unsigned Multiply 


NBCD 


nbcd.b 


EA 


Negate Decimal 
with Extend 


NEG 


neg.S 


EA 


Negate 


NEGX 


negx.S 


EA 


Negate with Extend 


NOP 


nop 




No operation 


NOT 


not.S 


EA 


Logical Complement 


OR 


or.S 


EA,%dn 
%dn,EA 


Inclusive OR Logical 


ORI 


or.S 


&I,EA 


Inclusive OR Immediate 


ORI 
toCCR 


or.b 


&I,%cc 


Inclusive OR Immediate 
to Condition Codes 


ORI 
toCCR 


or.w 


&I,%sr 


Inclusive OR Immediate 
to the Status Register 


PEA 


pea 


EA 


Push Effective Address 


RESET 


reset 




Reset External Devices 


ROL 


rol.S 


%dx,%dy 
&Q,%dy 


Rotate 

(without Extend) (left) 



rol.w 



&I,EA 
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Operation 



RTR 

RTS 
SBCD 

Sec 
STOP 

SUB 



MC68010 Instruction Formats 

MAS Syntax Meaning 



ROR 


ror.S 


%dx,%dy 
&Q,%dy 


Rotate 

(without Extend) (Right) 




ror.w 


&I,EA 




ROXL 


roxl.S 


%dx,%dy 
&Q,%dy 


Rotate with Extend(Left) 




roxl.W 


&I,EA 




ROXR 


roxr.S 


%dx,%dy 
&Q,%dy 


Rotate with Extend(Right) 




roxr.w 


&I,EA 




RTE 


rte 




Return from Exception 


RTD 


rtd 




Return and Deallocate 



rtr 

rts 
sbcd.b 

sCC.b 
stop 

sub.S 



%dy,%dx 

-(%ay) 
-(%ax) 

EA 

&l 



Stack 

Return and Restore 
Condition Codes 

Return from Subroutine 

Subtract Decimal with 
Extend 



Set According to Condition 

Load Status Register 
and Stop 



EA, %dn Subtract Binary 
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Operation 



MC68010 Instruction Formats 

MAS Syntax Meaning 







%dn,EA 




SUBA 


sub.A 


EA,%an 


Subtract Address 


SUBI 


sub.S 


&I,EA 


Subtract Immediate 


SUBQ 


sub.S 


&Q,EA 


Subtract Quick 


SUBX 


subx.S 


%dy,%dx 

-(%ay) 

-(%ax) 


Subtract with Extend 


SWAP 


swap.w 


%dn 


Swap Register Halves 


TAS 


tas.b 


EA 


Test and Set an Operand 


TRAP 


trap 


&I 


Trap 


TRAPV 


trapv 




Trap on Overflow 


TST 


tst.S 


EA 


Test an Operand 


UNLK 


unlk 


%an 


Unlink 
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Chapter 9 
THE "curses'' PACKAGE 

INTRODUCTION 

The UNIX PC software development system includes two 
different terminal virtualization packages, terminal access 
method (tarn) and curses. Each provides device independent 
terminal input/output. 

The tam package is recommended for programming on the 
UNIX PC because it offers more capabilities than curses, 
tam has the following features that are not available in 
curses: 

• The shared library feature of the UNIX PC is used, so 
programs written with tam can be significantly smaller 
than those written with curses. 

• Real, overlapping windows are supported. 

• Context sensitive help messages are supported. 

• Device independent input is supported, (curses only 
supports device independence on output.) 

• Menus, forms, and messages are supported. 

• Both high and low level mouse support routines are 
provided. 

• The most frequently used curses calls are emulated by 
tam to allow easy porting of code already written using 
curses. 
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Programs previously written with curses can be ported using 
the UNIX PC curses package. 

The full curses package that is supported on the UNIX PC is 
documented in the curses(3) manual page. This chapter is an 
introduction to curses(3X). It is intended for the programmer 
who must write a screen-oriented program using the curses 
package. This chapter also documents curses functions. 

For curses to be able to produce terminal dependent output, it 
has to know what kind of terminal you have. The UNIX system 
convention for this is to put the name of the terminal in the 
variable TERM in the environment. Thus, a user on a DEC 
VTIOO would set TERM=vt100 when logging in. Curses uses 
this convention. 



Output 

A program using curses always starts by calling 
initscr( ). (See Figure 9-1.) Other modes can then be set 
as needed by the program. During the execution of the 
program, output to the screen is done with routines such as 
addch(ch) and printw( f mt , args ) . (These routines 
behave just like put char and print f except that they go 
through curses.) The cursor can be moved with the call 
move ( row, col ). These routines only output to a data 
structure called a window, not to the actual screen. A window 
is a representation of a CRT screen, containing such things as 
an array of characters to be displayed on the screen, a cursor, a 
current set of video attributes, and various modes and options. 
You don't need to worry about windows unless you use more 
than one of them, except to realize that a window is buffering 
your requests to output to the screen. 

To send all accumulated output, it is necessary to call 
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refresh ( ). (This can be thought of as a flush.) Finally, 
before the program exits, it should call endwin( ), which 
restores all terminal settings and positions the cursor at the 
bottom ot the screen. 



#include <curses.h> 

initscr(); /* Initialization */ 

raw(); /* Various optional mode settings */ 
nonl ( ) ; 
noecho { ) ; 

while ( ! done ) {/* Main body of program */ 

/* Sample calls to draw on screen */ 
move(row, col); 
addch( ch) ; 

printw (" Formatted print with value %d\n" , value ) ; 

/* Flush output */. 
refresh( ) ; 



endwin( ) ; /* Clean up */ 
exit { ) ; 

Figure 9-1 — Framework of a Curses Program 



Some programs assume all screens are 24 lines by 80 columns. 
It is important to understand that many are not. The variables 
LINES and COLS are defined by initscr with the current 
screen size. Programs should use them instead of assuming a 
24x80 screen. 
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No output to the terminal actually happens until refresh is 
called. Instead, routines such as move and addch draw on a 
window data structure called stdscr (standard screen). 
Curses always keeps track of what is on the physical screen, 
as well as what is in stdscr. 

When refresh is called, curses compares the two screen 
images and sends a stream of characters to the terminal that 
will turn the current screen into what is desired. Curses 
considers many different ways to do this, taking into account 
the various capabilities of the terminal, similarities between 
what is on the screen and what is desired. It usually outputs as 
few characters as is possible. This function is called cursor 
optimization and is the source of the name of the curses 
package. 

NOTE: Due to the hardware scrolling of terminals, 
writing to the lower righthand character position 
is impossible. 



Input 

Curses can do more than just draw on the screen. Functions 
are also provided for input from the keyboard. The primary 
function is getch( ) which waits for the user to type a 
character on the keyboard, and then returns that character. 
This function is like get char except that it goes through 
curses. Its use is recommended for programs using the 
raw() or noecho() options, since several terminal or 
system dependent options become available that are not 
possible with getchar. The routine getstr(str) can be 
called, allowing input of an entire line, up to a newline. This 
routine handles echoing and the erase and kill characters of the 
user. 
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getstr 

No matter what the setting of echo is, strings typed in here are 
echoed at the current cursor location. The user's erase and kill 
characters are understood and handled. This makes it 
unnecessary for an interactive program to deal with erase, kill, 
and echoing when the user is typing a line of text. 



Highlighting 

Characters can be written with the standout attribute. This 
attribute is used to make text attract the attention of the user. 
The particular hardware attribute used for standout varies 
from terminal to terminal, and is chosen to be the most visually 
pleasing attribute the terminal has. Standout is typically 
implemented as reverse video or bold. Many programs don't 
really need a specific attribute, such as bold or inverse video, 
but instead just need to highlight some text. Two functions, 
standout () and standend() turn on and off this 
attribute. 



Multiple Windows 

A window is a data structure representing all or part of the 
CRT screen. It has room for a two dimensional array of 
characters, with a standout bit for each character (a total of 8 
bits per character: 7 for text and 1 for attribute), a cursor, a set 
of current attributes, and a number of flags. Curses provides a 
full screen window, called stdscr, and a set of functions 
that use stdscr. Another window is provided called 
cursor, representing the physical screen. 

It is important to understand that a window is only a data 
structure. Use of more than one window does not imply use of 
more than one terminal, nor does it involve more than one 
process. A window is merely an object which can be copied to 
all or part of the terminal screen. The current implementation 
of curses does not allow windows which are bigger than the 
screen. 
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The programmer can create additional windows with the 
function newwin ( lines , cols, begin_row, 

begin_col). This function returns a pointer to a newly 
created window. The window will be lines by cols, and 
the upper left corner of the window will be at screen position 
(begin_row, begin_col). All operations that affect 
stdscr have corresponding functions that affect an arbitrary 
named window. Generally, these functions have names formed 
by putting a "w" on the front of the stdscr function, and the 
window name is added as the first parameter. Thus, 
waddch(mywin , c ) would write the character c to window 
mywin. The wref resh( win ) function is used to flush the 
contents of a window to the screen. 

Windows are useful for maintaining several different screen 
images, and alternating the user among them. Also, it is 
possible to subdivide the screen into several windows, 
refreshing each of them as desired. When windows overlap, the 
contents of the screen will be the more recently refreshed 



In all cases, the non-w version of the function calls the w 
version of the function, using stdscr as the additional 
argument. Thus, a call to addch(c) results in a call to 
waddch( stdscr , c). 

The main display is kept in stdscr. When the user 
temporarily wants to put something else on the screen, a new 
window is created covering part of the screen. A call to 
wrefresh on that window causes the window to be written 
over stdscr on the screen. Calling refresh on stdscr 
results in the original window being redrawn on the screen. If 
you have trouble refreshing a new window which overlaps an 
old window, it may be necessary to call touchwin on the new 
window to get it completely written out. 

For convenience, a set of "move" functions are also provided for 
most of the common functions. These result in a call to move 
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before the other function. For example, mvaddch(row, 
col, c) is the same as move (row, col); addch(c). 
Combinations, e.g. mvwaddch( row , col, win, c) also 
exist. 



LIST OF ROUTINES 

This section describes all the routines available to the 

programmer in the curses package. The routines are 

organized by function. For an alphabetical list, see 
curses(3X). 



Structure 

All programs using curses should include the file 
<curses.h>. This file defines several curses functions as 
macros, and defines several global variables and the datatype 
WINDOW. References to windows are always of type WINDOW 
*. Curses also defines WINDOW * constants stdscr (the 
standard screen, used as a default to routines expecting a 
window), and cursor (the current screen, used only for 
certain low level operations like clearing and redrawing a 
garbaged screen). Integer constants LINES and COLS are 
defined, containing the size of the screen. Constants TRUE 
and FALSE are defined, with values 1 and 0, respectively. 
Additional constants which are values returned from most 
curses functions are ERR and OK. OK is returned if the 
function could be properly completed, and ERR is returned if 
there was some error, such as moving the cursor outside of a 
window. 

The include file <curses.h> automatically includes 
<stdio. h> and the tty driver interface file, <termio .h>. 
Including <stdio.h> again is harmless but wasteful. 

A program using curses should include the loader option 
— Icurses in the makefile. This is true for both the termcap 
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level and the curses level. 



Initialization 

These functions are called when initializing a program. 

initscr ( ) 

The first function called should always be initscr. This 
will determine the terminal type and initialize curses data 
structures. initscr also arranges that the first call to 
refresh will clear the screen. 



endwin ( ) 

A program should always call endwin before exiting. This 
function will restore tty modes, move the cursor to the lower 
left corner of the screen, reset the terminal into the proper 
non-visual mode, and tear down all appropriate data structures. 

longname ( termbuf , name) 

This function returns a pointer to a static area containing a 

verbose description of the current terminal, after a call to 

initscr. 



Option Setting 

These functions set options within curses. In each case, win 
is the window affected, and bf is a boolean flag with value 
TRUE or FALSE indicating whether to enable or disable the 
option. All options are initially FALSE. It is not necessary 
to turn these options off before calling endwin. 
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clearok(win,bf) 

If set, the next call to wrefresh with this window will clear 
the screen and redraw the entire screen. If win is cursor, 
the next call to wrefresh with any window will cause the 
screen to be cleared. This is useful when the contents of the 
screen are uncertain, or in some cases for a more pleasing 
visual effect. 



leaveok;(win,bf ) 

Normally, the hardware cursor is left at the location of the 
window cursor being refreshed. This option allows the cursor 
to be left wherever the update happens to leave it. It is useful 
for applications where the cursor is not used, since it reduces 
the need for cursor motions. If possible, the cursor is made 
invisible when this option is enabled. 

scrollok ( win , bf ) 

This option controls what happens when the cursor of a window 
is moved off the edge of the window, either from a newline on 
the bottom line, or typing the last character of the last line. If 
disabled, the cursor is left on the bottom line. If enabled, 
wrefresh is called on the window, and then the physical 
terminal and window are scrolled up one line. Note that in 
order to get the physical scrolling effect on the terminal, it is 
also necessary to call idlok. 



Terminal Mode Setting 

These functions are used to set modes in the tty driver. The 
initial mode usually depends on the setting when the program 
was called: the initial modes documented here represent the 
normal situation. 
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echo ( ) 

noecho ( ) 

These functions control whether characters typed by the user 

are echoed as typed. Initially, characters typed are echoed by 

the teletype driver. Authors of many interactive programs 

prefer to do their own echoing in a controlled area of the 

screen, or not to echo at all, so they disable echoing. 

nl( ) 
nonl ( ) 

These functions control whether newline is translated into 
carriage return and linefeed on output, and whether return is 
translated into newline on input. Initially, the translations do 
occur. By disabling these translations, curses is able to make 
better use of the linefeed capability, resulting in faster cursor 
motion. 



raw( ) 
norawC ) 

The terminal is placed into or out of raw mode. Raw mode is 
similar to cbreak mode in that characters typed are 
immediately passed through to the user program. The 
differences are that in RAW mode, the interrupt, quit, and 
suspend characters are passed through uninterpreted instead of 
generating a signal. RAW mode also causes 8 bit input and 
output. The behavior of the BREAK key may be different on 
different systems. 

resetty( ) 

savetty ( ) 

These functions save and restore the state of the tty modes. 

savetty saves the current state in a buffer, resetty 

restores the state to what it was at the last call to savetty. 
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Window Manipulation 

newwin ( num_lines , nuin_cols, beg_row, 

beg_col ) 

Create a new window with the given number of lines and 
columns. The upper left corner of the window is at line 
beg_row column beg_col. If either num_lines or 
num_cols is zero, they will be defaulted to LINES - 
beg_row and COLS-beg_col. A new full-screen window is 
created by calling newwin (0,0,0,0) . 

subwin(orig, num_lines, num_cols, begy, 
begx ) 

Create a new window with the given number of lines and 
columns. The window is at position (begy, begx) on the screen. 
(It is relative to the screen, not orig.) The window is made 
in the middle of the window orig, so that changes made to 
one window will affect both windows. When using this 
function, often it will be necessary to call touchwin before 
calling wrefresh. 

delwin ( win ) 

Deletes the named window, freeing up all memory associated 
with it. In the case of overlapping windows, subwindows should 
be deleted before the main window. 



mvwin(win, br , be) 

Move the window so that the upper left corner will be at 
position ( br , be ) . If the move would cause the window to 
be off the screen, it is an error and the window is not moved. 

touehwin ( win ) 

Throw away all optimization information about which parts of 
the window have been touched, by pretending the entire window 
has been drawn on. This is sometimes necessary when using 
overlapping windows, since a change to one window will affect 
the other window, but the records of which lines have been 
changed in the other window will not reflect the change. 
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overlay ( win1 , win2 ) 

overwrite ( win 1 , win2 ) 

These functions overlay win1 on top of win2; that is, all 

text in win1 is copied into win 2. The difference is that 

overlay is nondestructive (blanks are not copied) while 

overwrite is destructive. 



Causing Output to the Terminal 

ref resh( ) 
wref resh ( win ) 

These functions must be called to get any output on the 
terminal, as other routines merely manipulate data structures, 
wref resh copies the named window to the physical terminal 
screen, taking into account what is already there in order to do 
optimizations. refresh is the same, using stdscr as a 
default screen. Unless leaveok has been enabled, the physical 
cursor of the terminal is left at the location of the window's 
cursor. 



Writing on Window Structures 

These routines are used to "draw" text on windows. In all 
cases, a missing win is taken to be stdscr. y and x are 
the row and column, respectively. The upper left corner is 
always (0,0), not (1,1). The mv functions imply a call to move 
before the call to the other function. 



Moving the Cursor 

move(y, x) 

wmove(win, y, x) 

The cursor associated with the window is moved to the given 

location. This does not move the physical cursor of the 

terminal until refresh is called. The position specified is 

relative to the upper left corner of the window. 
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Writing One Character 

addch( ch) 
waddch(win, ch) 
mvaddch(y, x, ch) 
mvwaddch ( win , y, x, ch) 

The character ch is put in the window at the current cursor 
position of the window. If ch is a tab, newline, or backspace, 
the cursor will be moved appropriately in the window. If ch is 
a different control character, it will be drawn in the X 
notation. The position of the window cursor is advanced. At 
the right margin, an automatic newline is performed. At the 
bottom of the scrolling region, if scrollok is enabled, the 
scrolling region will be scrolled up one line. 



Writing a String 

addstr ( str ) 

waddstr ( win , str ) 

mvaddstr ( y , x , str ) 

mvwaddstr (win,y,x,str) 

These functions write all the characters of the null terminated 

character string str on the given window. They are identical 

to a series of calls to addch. 



Clearing Areas of the Screen 

erase ( ) 

werase ( win ) 

These functions copy blanks to every position in the window. 



clear ( ) 

wclear(win) 

These functions are like erase and werase but they also 

call clear ok, arranging that the screen will be cleared on 

the next call to refresh for that window. 
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clrtobot ( ) 

wclrtobot(win) 

All lines below the cursor in this window are erased. Also, the 

current line to the right of the cursor is erased. 

clrtoeol ( ) 

wclrtoeol ( win ) 

The current line to the right of the cursor is erased. 



Inserting and Deleting Text 

delch( ) 

wdelch ( win ) 

mvdelch ( y , x ) 

mvwdelch( win , y , x ) 

The character under the cursor in the window is deleted. All 

characters to the right on the same line are moved to the left 

one position. This does not imply use of the hardware delete 

character feature. 



deleteln { ) 

wdeleteln ( win ) 

The line under the cursor in the window is deleted. All lines 

below the current line are moved up one line. The bottom line 

of the window is cleared. This does not imply use of the 

hardware delete line feature. 



insch ( c ) 

winsch(win, c) 

mvinsch( y , x , c ) 

mvwinsch ( win , y , x , c ) 

The character c is inserted before the character under the 

cursor. All characters to the right are moved one space to the 

right, possibly losing the rightmost character on the line. This 

does not imply use of the hardware insert character feature. 
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insertln( ) 

winsertln(win) 

A blank line is inserted above the current line. The bottom line 

is lost. This does not imply use of the hardware insert line 

feature. 



Formatted Output 

printw(fmt, args ) 

wpr intw( win , f mt , args) 

mvprintw(y, x, f mt , args) 

mvwpr intw ( win , y, x, f mt , args) 

These functions correspond to printf . The characters which 

would be output by printf are instead output using 

waddch on the given window. 



Miscellaneous 

box(win, vert, hor ) 

A box is drawn around the edge of the window. vert and 

hor are the characters the box is to be drawn with. 



scroll ( win ) 

The window is scrolled up one line. This involves moving the 
lines in the window data structure. As an optimization, if the 
window is stdscr and the scrolling region is the entire 
window, the physical screen will be scrolled at the same time. 



Input from a Window 

getyx ( win , y , x ) 

The cursor position of the window is placed in the two integer 

variables y and x. Since this is a macro, no & is necessary. 
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inch( ) 

winch (win ) 

mvinch ( y , x ) 

mvwinch ( win , y , x ) 

The character at the current position in the named window is 

returned. 



Input from the Terminal 

getch( ) 

wgetch( win ) 

mvgetch (y , x ) 

mvwgetch( win , y , x ) 

A character is read from the terminal associated with the 

window. The program will wait until the system passes text 

through to the program. Depending on the setting of raw, this 

will be after one character, or after the first newline. 



getstr ( str ) 

wgetstr ( win , str ) 

mvgetstr ( y , x , str ) 

mvwgetstr (win,y,x,str) 

A series of calls to getch is made, until a newline is received. 

The resulting value is placed in the area pointed at by the 

character pointer str. The user's erase and kill characters 

are interpreted. 

scanw(fmt, args ) 

wscanw(win, f mt , args) 

mvscanw(y, x, fmt, args) 

mvwscanw( win , y, x, fmt, args) 

This function corresponds to scanf. wgetstr is called on 

the window, and the resulting line is used as input for the scan. 
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Video Attributes 

standout ( ) 

standend ( ) 

wstandout ( win ) 

wstandend ( win ) 

The current attributes of a window are applied to all characters 

that are written into the window with waddch. Attributes 

are a property of the character, and move with the character 

through any scrolling and insert/delete line/character 

operations. To the extent possible on the particular terminal, 

they will be displayed as the graphic rendition of characters 

put on the screen. 

standout ( ) 

turns on highlighting for subsequent characters. 

standend ( ) 

turns off highlighting. 



Lower Level Functions 

These functions are provided for programs not needing the 
screen optimization capabilities of curses. Programs are 
discouraged from working at this level, since they must handle 
various glitches in certain terminals. However, a program can 
be smaller if it only brings in the low level routines. 



Cursor Motion 

mvcur ( oldrow, oldcol, newrow, newcol ) 
This routine optimally moves the cursor from (oldrow, oldcol) 
to (newrow, newcol). The user program is expected to keep 
track of the current cursor position. Note that unless a full 
screen image is kept, curses will have to make pessimistic 
assumptions, sometimes resulting in less than optimal cursor 
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motion. For example, moving the cursor a few spaces to the 
right can be done by transmitting the characters being moved 
over, but if curses does not have access to the screen image, it 
doesn't know what these characters are. 



Additional Terminals 

Curses will work even if absolute cursor addressing is not 
possible, as long as the cursor can be moved from any location 
to any other location. It considers local motions, parameterized 
motions, home, and carriage return. 

Curses is aimed at full duplex, alphanumeric, video terminals. 
No attempt is made to handle half-duplex, synchronous, hard 
copy, or bitmapped terminals. Bitmapped terminals can be 
handled by programming the bitmapped terminal to emulate an 
ordinary alphanumeric terminal or by using the tani(3) library. 
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Chapter 10 
USING SHELL COMMANDS 

INTRODUCTION 

This chapter provides information to enhance uses of the shell. 
Most information should be useful to both the programmer and 
nonprogrammer alike. Some information may be of more use 
to the more advanced user. It is assumed that the user has 
been introduced to the UNIX system and understands such 
basics as how to log in, set the terminal baud rate, etc. 

EXECUTING SIMPLE SHELL COMMANDS 

A simple shell command consists of the command name 
possibly followed by some arguments such as 

cmd argl arg2 argS . . . 

where cmd is the command name consisting of a sequence of 
letters, digits, or underscores beginning with a letter or 
underscore. For example, the shell command 

Is 

prints a list of files in the current directory. 
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INPUT/OUTPUT REDIRECTION 

Most commands produce output to a terminal. Output can be 
redirected to a file in two different ways. First, standard 
output may be redirected to a file by the notation " >" , thus 



Is -1 > tempfile 

causes the shell to redirect the output of the command Is to be 
put in tempfile. If there is no file tempfile, one is created by the 
shell. Any previous contents of tempfile are destroyed. 

Standard output may be appended to the end of a file by the 
notation " »" , thus 

Is -1 » tempfile 

causes the shell to append the output of the command Is to the 
end of the contents of tempfile. If tempfile does not already 
exist, it is created. 

Although input is normally from a terminal, it can also be 
redirected by the " <" notation. Thus 

wc < tempfile 

would send the contents of tempfile to the wc command which 
would give a character, word, and line count of tempfile. 
Another modification of input is possible with the " «" 
notation. The form 

cmd « word 

would send standard input to the specified command until a 
line the same as word is input. As an example 
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sort « finished 

would send all the standard input to sort until finished is 
input. Then the input would be sorted and output to the 
terminal. If the notation " «-" is used, then all leading tabs 
would be stripped. As an example, the following is entered at 
the terminal (note that the primary system prompt # and the 
secondary system prompt > provided by the system are shown 
in this example): 

$sort «end 

>no one does anything about it 

>everyone talks about the weather but 

>end 

and the following would be returned: 

everyone talks about the weather but 
no one does anything about it 



PIPELINES AND FILTERS 

The standard output of one command may be connected to the 
standard input of another by using the pipe ( | ) operator 
between commands as in 



Is -1 I wc 

A sequence of one or more commands connected in this way 
constitutes a pipeline, and the overall effect is the same as 

Is -1 > file; wc < file 

except no file is used. Instead the two processes are connected 
together by a pipe [see pipe(2)] and are run in parallel. Each 
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command is run as a separate process. 

Pipes allow one to execute several commands sequentially from 
left to right with the standard output from each command 
becoming the standard input of the next command. This 
prevents creating temporary files and is faster than not using 
pipes. Pipes are unidirectional. Synchronization is achieved by 
halting wc when there is nothing to read and halting Is when 
the pipe is full. 

A filter is a command that reads its standard input, transforms 
it in some way, and prints the result as output. One such filter, 
grep(l), selects from its input those lines that contain some 
specified string. For example. 

Is I grep old 

prints those lines that contain the string " old" . Another filter 
is the sort(l) command that gives alphabetical listings. 



PERMISSION MODES 

All UNIX system files have three independent attributes (often 
called "permissions"), read, write, and execute (rwx). These 
three permissions are assigned to three different levels of users. 
The first level is the owner level. Normally, the creator of the 
file is the owner. This ownership can be changed with the 
chown(l) command. The second level is the group level. The 
third level is the others level. The permission for each level 
must be set to allow reading, writing, or executing a file. 

The Is command will display among other things the 
permissions for a file when used as follows 

Is -1 filename 
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The general format of the permissions is 

-rwxrwxrwx 

where the first character will be a dash if it is an ordinary file. 
The second, third and fourth characters (the first rwx) 
indicate the permission modes for the owner. The fifth, sixth, 
and seventh characters (the second rwx) indicate the 
permission modes of the group. And the eighth, ninth, and 
tenth characters (the last rwx) indicate the permission modes 
of others. A dash in any permission mode position indicates 
that the mode is not allowed. 

For example, the input 

Is -1 wg 
displays the permissions of wg as follows: 

-rwxr-x— 1 abc UNIX 66 May 4 09:25 wg 

In this case, the owner has read (r), write (w), and execute (x) 
permission, the group has read and execute permission, and all 
others are denied (-) permission to wg. 

The chmod(l) command is used by the owner to change the 
permission modes of a file. To change the permissions of wg so 
that everyone could execute the procedure, enter the following 
command: 

chmod 751 wg 

which would result in a permission mode of rwxr-x--x. The 7 
assigns the owner read, write, and execute permission [4 (read) 
+ 2 (write) + 1 (execute) = 7]. The 5 assigns the group read 
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and execute permission [4 (read) + 1 (execute) = 5]. The 1 
assigns others execute permission. 

The chmod command could also be entered as 

chmod +x wg 

which would add execute permission for owner, group, and all 
others. 

FILE NAME GENERATION 

The shell provides a mechanism for generating a list of file 
names that match a pattern. For example. 

Is -1 *.c 

generates as arguments to ls(l) all file names in the current 
directory that end in .c. The character "*" is a pattern that 
will match any string including the null string. In general, 
patterns are specified as follows: 



Matches any string of characters 
including the null string. 

Matches any single character. 

Matches any character enclosed. A pair 
of characters separated by a minus will 
match any character lexically between the 
pair. 
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For example, 

Is -1 [a-z]* 

matches all names in the current directory beginning with 
letters a through z. The input 

Is -1 /usr/f red/test/? 

matches all names in the directory /usr/fred/test that consist 
of a single character. This mechanism is useful both to save 
typing and to select names according to some pattern. 

There is one exception to the general rules given for patterns. 
The character **." at the start of a file name must be explicitly 
matched. The input 



echo 



prints all file names in the current directory not beginning with 
".". The input 



echo 



prints all those file names that begin with **.". This avoids 
inadvertently matching the names "." and ".." that mean "the 
current directory" and "the parent directory," respectively. 
[Notice that ls(l) suppresses information for the files "." and 
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QUOTING 

Characters that have a special meaning to the shell, such as 

<> * ? I & $ ; \ " " [ ] 

are called metacharacters . 

The shell can be inhibited from interpreting and acting upon 
the special meaning assigned metacharacters by preceding them 
with a backslash (\). Any character preceded by a \ loses its 
special meaning. For example 

echo * 



prints all the file names in the current directory. To echo an 
asterisk , enter 



echo \* 

The backslash turns off any special meaning of a 
metacharacter. 

To allow long strings to be continued over more than one line, 
the sequence \newline (or RETURN) is ignored. The \ is 
convenient for quoting single characters. When more than one 
character needs quoting, the above mechanism is clumsy and 
error prone. A string of characters may be quoted by enclosing 
the string between single quotes. All characters enclosed 
between a pair of single quote marks are quoted except for a 
single quote. For example, 

echo xx'****'xx 

will print 



10-8 



USING SHELL COMMANDS 






The quoted string may not contain a single quote but may 
contain new lines that are preserved. This quoting mechanism 
is the simplest and is recommended for casual use. 



EXECUTING COMMANDS IN THE 
BACKGROUND 

To execute a command, the shell normally creates a new 
process and waits for it to finish. A command may be run 
without waiting for it to finish. Executing commands in the 
background enables the terminal to be used for other tasks. 
Adding an ampersand (&) at the end of a command line before 
the RETURN starts the execution of a command and 
immediately returns to the shell command level. For example, 

cc pgm.c & 

calls the C compiler to compile the file pgm.c. The trailing "&" 
is an operator that instructs the shell not to wait for the 
command to finish. To help keep track of such a process, the 
shell reports its process number following its creation. This 
means the system will respond with a process number followed 
by the primary shell prompt. 



Determining Completion of Background Commands 

When a command is executed in the background, a prompt is 
not received when the command completes execution. The only 
way to see that the command is either in process or complete is 
to request process status. The status of all active processes 
assigned to a user can be reported as follows 



ps -u ulist 
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where " ulist" is the login name. If the process number and 
associated command name are output by the ps command, then 
the command is running in the background. If the process 
number and associated command name are not output by the 
ps command, then the command has finished executing. 



Terminating Background Commands 

Once a command starts in the background, it will run until it is 
finished or is stopped. The BREAK, RUBOUT, DELETE, or 
other keys will not stop a command running in the background. 
Instead, the process must be "killed" with the kill(l) command 
as follows: 



kill PID 

where " PID" is the process identification number. The shell 
variable $! contains the PID of the last process run in the 
background and can be obtained as follows: 

echo $! 

All nonessential background processes can be stopped by 
executing the following command: 



killO 



Some processes can ignore the software termination signal. To 
stop these processes, enter the following: 

kill -9 PID 

A process running in the background is automatically killed 
when the user logs out. The noliup(l) command can be used to 
continue the process after logging off or hanging up. For 
example, 
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nohup nroff text & 

would continue the formatting of the file text using the 
nroff(l) formatter even if one logged off or the telephone line 
to the computer went down. The system responds with the 
lines: 

28096 

$ Sending output to nohup.out 

The 28096 is the process ID number. A file nohup.out is 
created by the nohup command, and all output of the process 
is directed to this file. To redirect the output to a particular 
file, use the redirect command as follows: 

nohup nroff text & > formatted 

to direct the output to the file formatted. 



SHELL VARIABLES 

A variable is a name representing a string value. (Loosely 
defined, a string is a combination of one or more alphanumeric 
characters or symbols.) Variables that are normally set on a 
command line are called parameters. There are two types of 
parameters in the shell— positional and keyword. 



Positional Parameters 

When a shell procedure is invoked, the shell implicitly creates 
positional parameters. The shell assigns the positional 
parameters as follows: 

${0}${1}${2}${3}...${9} 
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Since the general form of a simple command is 

cmd argl arg2 arg3 ... 
then the values of the positional parameters are 

cmd argl arg2 arg3 . . . arg9 

${0} ${1} ${2} ${3} ... ${9} 

For instance, if the following command is entered 

cmd tempi temp2 temp3 

then the positional parameter ${1} would have the value 
tempi. Notice that the command procedure name is always 
assigned to ${0}. 

The positional parameters are used often in shell programs. If 
a shell program, wg, contained 

who I grep $1 
then the call to run the program 

sh wg fred 
is equivalent to 

who I grep fred 

The variable $* is a special shell parameter used to substitute 
for all positional parameters except $0. Certain other similar 
variables are used by the shell. The following are set by the 
shell: 
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$? The exit status (return code) of the last command 

executed as a decimal string. Most commands return 
a zero exit status if they complete successfully; 
otherwise, a nonzero exit status is returned. Testing 
the value of return codes is dealt with later under if 
and while commands. 

$# The number of positional parameters in decimal. 

$$ The process number of this shell in decimal. Since 

process numbers are different from all other existing 
processes, this string is frequently used to generate 
temporary file names. For example, 

ps -a >/tmp/ps$$ 
rm /tmp/ps$$ 

$! The process number of the last process run in the 

background (in decimal). 

$— The current shell flags, such as — x and — v . 



Keyword Parameters 

The shell uses certain variables known as keyword parameters 
for specific purposes. The following variables are discussed in 
this portion of the document: 

HOME 

PATH 

CDPATH 

MAIL 

PSl 

PS2 

IFS 

SHELL. 
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HOME 



The variable HOME is used by the shell as the default value 
for the cd(l) command. Entering 

cd 

is equivalent to entering 

cd $HOME 

where the value of HOME is substituted by the shell. If 
$H0ME=/d3/abc/def, then each of the above two entries would 
be equivalent to 

cd /d3/abc/def 

Normally, HOME is initialized by logm(l) to the login 
directory. The value of HOME can be changed to /dS/ahc/ghi 
by entering the following 

H0ME=/d3/abc/ghi 

No spaces are permitted. The change of the variable will have 
no effect unless the value is exported [see export in Chapter 
11 under "Special Commands" and in sh(l)]. All variables 
(with their associated values) that are known to a command at 
the beginning of execution of that command constitute its 
environment. To change the environment to a new variable 
setting, the following must be entered: 

export variable-name 

For instance, if HOME has been modified, then the command 
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export HOME 

will cause the environment to be modified accordingly. The 
variable HOME need be exported only once. At login the next 
time, the original variable settings will be reestablished. A 
change to the .profile would modify the environment for each 
new login. 



PATH 

The variable PATH is used by the shell to specify the 
directories to be searched to find commands. Each directory 
entry in the PATH variable is separated by a colon (:). Several 
directories can be specified in the PATH variable but each 
directory before the command is found consumes processor 
time. Obviously, the directories that contain the most often 
used commands should be specified first to reduce searching 
time. The following is the default PATH value: 

PATH=:/bin:/usr/bin 

Since no value precedes the first :, then the current directory is 
the first directory searched. Then directory A>in is searched 
followed by /usrA)in. To change the PATH variable, simply 
enter PATH= followed by the directories to be searched. Each 
directory should be separated by a colon. As when changing all 
variables, no spaces are allowed before or after the =, 



CDPATH 

The variable CDPATH specifies where the shell is to look 
when it is searching for the argument of the cd command if 
that argument is not null and does not begin with ../, ./, or /. 
For example, if the CDPATH variable were 

CDPATH=:/d3/abc/def:/d3/abc 
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then the command 

cd ghi 

would cause the current directory, /dS/abc/def directory, and 
/dS/abc directory to be searched for the subdirectory ghi. If 
found in the /dS/ahc/def directory, the full pathname of the 
subdirectory would be printed and the current working 
directory would be changed to /dS/ahc/def/ghi. 

MAIL 

The shell looks at the file specified by the MAIL variable and 
informs the user if there are any modifications. 



PSl 

The variable PSl is used by the shell to specify the primary 
shell prompt. This is displayed at a terminal whenever the 
shell is awaiting a command input. The default primary 
prompt is $. To change the prompt to <>, for example, the 
following is entered: 

PS1=" <>" 



PS2 

The variable PS2 is used by the shell to specify the secondary 
shell prompt. This is displayed whenever the shell receives a 
newline in its input but more is expected. The default value of 
PS2 is >. To change the prompt to <more> for example, the 
following is entered: 

PS2=" <more>" 
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IFS 

The variable IFS is used by the shell to specify the internal 
field separators. Normally, the space, tab, and newline 
characters are used. After parameter and command 
substitution, internal field separators are used to split the 
results of substitution into distinct arguments where such 
characters are found. Explicit null arguments (" " and ' ') are 
retained. 



User Defined Variables 

A user variable can be defined using an assignment statement 
of the form name=value. The name must begin with a letter or 
underscore and may then consist of any sequence of letters, 
digits, or underscores. The name is the variable. Positional 
parameters cannot be in the name. 

The shell provides string-valued variables. Variable names 
begin with a letter and consist of letters, digits, and 
underscores. Variables may be given values by entering 

user=fred box=mOOO acct=mhOOO 

to assign values to the variables user, box, and acct. A variable 
may be set to the null string by entering 



null= 



The value of a variable is substituted by preceding its name 
with $. For example, 

echo $user 

will print /rerf. 
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Variables may be used interactively to provide abbreviations 
for frequently used strings. For example, 

b=/usr/fred/bin 
mv file $b 

moves the file from the current directory to the directory 
/usr/f red/bin . A more general notation is available for 
parameter (or variable) substitution as in 

echo ${user} 

This is equivalent to 

echo $user 

and is used when the parameter name is followed by a letter or 
digit. For example, 

tmp=/tmp/ps 
ps a >${tmp}a 

directs the output of ps(l) to the file /tmp/psa, whereas, 

ps a >$tmpa 

causes the value of the variable tmpa to be substituted. 
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SPECIAL COMMANDS 

The following special commands are used in writing shell 
procedures. Many of the commands are only needed when 
programming. Others have nonprogramming uses. 





read 




readonly 


break 


return 


continue 


set 


cd 


shift 


echo 


test 


eval 


times 


exec 


trap 


exit 


type 


export 


ulimit 


hash 


umask 


newgrp 


unset 


pwd 


wait 



The ones that are useful to the casual (nonprogramming) user 
are described below. 



cd 

The cd command is used to change the current working 
directory as follows: 

cd [arg] 

where arg specifies the new directory desired. For instance, 

cd /dS/abc/ghi 

moves the user from anywhere in the file system to the 
directory /dS/abc/ghi. The full directory pathname must be 
specified to be used in this way. Execute permissions must be 
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set in the desired directory. 

If only the desired directory name is specified and the 
CDPATH variable is not set, then the current directory is 
searched for a subdirectory by that name. For instance, if the 
current directory /dS/abc contains a subdirectory subdir, then 
the command 

cd subdir 

changes the current working directory to /dS/abc/subdir. If the 
argument begins with ../, the current working directory is 
changed relative to its parent directory. If the argument begins 
with ./, the current directory value precedes additional 
arguments. For instance, if the current working directory is 
/dS/abc, the following command: 

cd ./ghi 

changes the current directory to /dS/abc/ghi. 

If the variable CDPATH is set, the shell searches each 
directory specified in CDPATH for the directory specified by 
the cd command. If the directory is present, the directory 
becomes the new working directory. (See "CDPATH" under 
"Keyword Parameters.") 



exec 

The command 

exec [arg ...] 

causes the command specified by arg to be executed in place of 
the shell without creating a new process. Input/output 
arguments may appear and, if no other arguments are given, 
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cause the shell input/output to be modified. 



newgrp 

By issuing the command newgrp(l), the user is assigned a new 
group identification. The command is of the form 



newgrp [-] [group] 

All access permissions are then evaluated with the new group. 
This allows access to files with different group ID permissions. 

Entering newgrp with no argument changes the group 
identification back to the original group. When a — is entered, 
the environment is changed to the login environment. 



pwd 

The pwd command prints the full pathname of the current 
working directory. This command is especially useful when 
working directories are changed often. 



set 

The set command provides the capability of altering several 
aspects of the behavior of the shell by setting certain shell 
flags. Some of the more useful flags for the nonprogrammer 
and their meanings are: 

-a Mark variables that are modified or created for 
export. 

-f Disable file name generation. 

-V Print lines as they are read by the shell. The 
commands on each input line are executed after that 
input line is printed. 
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-X Print commands and their arguments as they are 
executed. This causes a trace of only those 
commands that are actually executed. 

To set the x flag for example, enter 

set -X 

To turn the x flag off for example, enter 

set +x 

These commands are especially useful for troubleshooting 
within shell procedures. 

The set command entered with no arguments will display the 
values of variables in the environment. 

ulimit 

The ulimit command has the form 

ulimit [-f] [n] 

When the option -/ is used or if no option is specified, this 
command imposes a limit of n blocks on the size of files written 
by the shell and its child processes. Any size files may be 
read. If n is omitted, the current value of this limit is printed. 
The default value for n varies from one installation to another. 
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umask 

The umask command has the form 

umask [nnn] 

The user file creation mask is set to nnn. This mask is used to 
determine the permission modes set on a file when it is created. 
For instance, 

umask 033 

causes a newly created file to be assigned the permission set of 
744. (See "PERMISSION MODES.") 



RESTRICTED SHELL 

A restricted shell is also available with the UNIX system. 
This restricted version of shell is used to create an 
environment that controls and limits the capabilities. The 
actions of rsh are identical to those of sh, except that the 
following are disallowed: 

• Changing directory 

• Setting the value of PATH variable 

• Specifying path or command names containing / 

• Redirecting output ( > and » ). 

The system administrator often sets up a directory of 
commands that can be safely invoked by rsh. A restricted 
editor may also be provided. 
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Chapter 11 
SHELL PROGRAMMING 

INTRODUCTION 

This chapter describes shell as a programming language and 
builds upon the information provided in Chapter 10. It is 
expected that the reader has read Chapter 10 and has 
experience with UNIX system commands. 

INVOKING THE SHELL 

The shell is an ordinary command and may be invoked in the 
same way as other commands: 

sh. proc [arg...] A new instance of the shell is 

explicitly invoked to read proc. 

sh —V proc [ arg ... ] This is equivalent to putting set 

—V at the beginning of proc. 
Similarly for other set flags 
including x, e, u, and n flags. 

proc [ arg ... 1 If proc is marked executable, and 

is not a compiled, executable 
program, the effect is similar to 
that of the sh proc [ args ... ] 
command. An advantage of this 
form is that proc may be found 
by the search procedure. 
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INPUT/OUTPUT 

Unless redirected by a command inside the program, a shell 
program uses the input and output connections of the shell 
program. A redirection on a command changes redirection for 
that command only. 



Single Line 

The following could be used to print a line from a program: 

echo The date is: 
date 

and would result in: 

The date is: 

Tue May 21 16:13:38 EDT 1984 

Printing Error Messages 

Normally, error messages are associated with file descriptor 2 
and are sent to standard error. Error messages can be 
redirected to a file with the following command: 

sample 2>ERR0R 

If an error message is produced when running the program 
sample, the error output is redirected to the file ERROR. 

Multiline Input (Here Documents) 

One way to input several lines to programs is with what is 
referred to as "Here Documents." The general form is 
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cmd argl arg2 ... <<word 

where everything entered at this command is accepted until 
word is entered on a line by itself. For example, 

sort « finish 

sends all the standard input to sort until finish is inputted. 
Then the input would be sorted and output to the terminal. For 
example 

$ sort «finish 

> def 

> abc 

> finish 
abc 

def 

Note that the primary system prompt ($) and the secondary 
system prompt (>) are shown. The final two lines are returned 
by the system. 

The command 

sort «-word 

removes all leading spaces or tabs. 



SHELL VARIABLES 

The shell has several mechanisms for creating variables. A 
variable is a name representing a string value. Certain 
variables are usually referred to as parameters. Parameters 
are the variables normally set only on a command line. There 
are also positional parameters and keyword parameters. Other 
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variables are simply names to which the user or the shell itself 
may assign string values. 

Positional Parameters: When a shell procedure is invoked, 
the shell implicitly creates positional 'parameters. The 
argument in position zero on the command line (the name of 
the shell procedure itself) is called $0, the first argument is 
called $1, and so on. The shift command may be used to access 
arguments in positions numbered higher than nine. 

One can explicitly force values into these positional parameters 
by using the set command 

set abc def ghi 

which assigns " abc" to the first positional parameter ($1), 
" def" to the second ($2), and " ghi" to the third ($3). For this 
example, set also unsets $4, $5, etc. even if they were 
previously set. Positional parameter $0 may not be assigned a 
value so that it always refers to the name of the shell 
procedure or to the name of the shell (in the login shell). 

For instance, 

set abc def ghi 
echo $3 $2 $1 

prints 

ghi def abc 

User-defined Variables: The shell also recognizes 
alphanumeric variables to which string values may be assigned. 
Positional parameters may not appear on the left-hand side of 
an assignment statement. Positional parameters can only be set 
as described in "Positional Parameters." A simple assignment 
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is of the form 

name = string 

Thereafter, $name yields the value " string" . A name is a 
sequence of letters, digits, and underscores that begins with a 
letter or an underscore. Note that no spaces surround the = in 
an assignment statement. 

More than one assignment may appear in an assignment 
statement, but beware since the shell performs the assignments 
from right to left. The following command line results in the 
variable a acquiring the value " abc" : 

a=$b b=abc 

The following are examples of simple assignments. Double 
quotes around the right-hand side allow blanks, tabs, 
semicolons, and newlines to be included in " string" , while also 
allowing variable substitution (also known as parameter 
substitution) to occur. In parameter substitution, references to 
positional parameters and other variable names that are 
prefaced by $ are replaced by the corresponding values, if any. 
Single quotes inhibit variable substitution. Some examples 
follow: 

MAIL=/usr/mail/gas 
var=" $1 $2 $3 $4" 
stars=***** 
asterisks='$stars' 

The variable var has as its value the string consisting of the 
values of the first four positional parameters, separated by 
blanks. No quotes are needed around the string of asterisks 
being assigned to stars because pattern matching (expansion 
of *, ?, [...]) does not apply in this context. Note that the value 
of $asterisks is the literal string " $stars" , not the string 
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" *****" , because the single quotes inhibit substitution. 

In assignments, blanks are not reinterpreted after variable 
substitution, so that the following example results in $first 
and $second having the same value: 

first='a string with embedded blanks' 
second=$first 

In accessing the value of a variable, one may enclose the 
variable's name (or the digit designating the positional 
parameter) in braces {} to delimit the variable name from any 
following string. In particular, if the character immediately 
following the name is a letter, digit, or underscore (digit only 
for positional parameters), then the braces are required 

a='This is a string' 
echo " ${a}ent test" 

returns the following message 

This is a stringent test 

Command Substitution: Any command line can be placed 
within grave accents C ■••^) to capture the output of the 
command. This concept is known as command substitution. 
The command or commands enclosed between grave accents are 
first executed by the shell and then their output replaces the 
whole expression, grave accents and all. This feature is often 
combined with shell variables so that 

today =^date^ 

assigns the string representing the current date to the variable 
today (e.g., Tue Nov 27 16:01:09 EST 1984). The command 
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users=^who | wc -^ 

saves the number of logged-in users in the variable users. Any 
command that writes to the standard output can be enclosed in 
grave accents. Grave accents may be nested. The inside sets 
must be escaped with \. For example 

logmsg=''echo Your login directory is ^pwdV^ 

Shell variables can also be given values indirectly by using the 
shell built-in command read. The read command takes a line 
from the standard input (usually the terminal) and assigns 
consecutive words on that line to any variables named. For 
example, 

read first init last 
will take an input line of the form: 

A. A. Smith 
and has the same effect as if 

first=A. init=A. last=Smith 

had been typed. 

The read command assigns any excess "words" to the last 
variable. 

Predefined Special Variables: Several variables have special 
meanings. The following are set only by the shell: 
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$# records the number of positional arguments passed 

to the shell, not counting the name of the shell 
procedure itself. The variable $# yields the number 
of the highest-numbered positional parameter that is 
set. Thus, sh X a b c sets $# to 3. One of its 
primary uses is in checking for the presence of the 
required number of arguments: 

if test $# -It 2 
then 

echo 'two or more args required'; exit 
fi 

$? is the exit status (also referred to as return code, 

exit code, or value) of the last command executed. 
Its value is a decimal string. Most UNIX system 
commands return to indicate successful 
completion. The shell itself returns the current 
value of $? as its exit status. 

$$ is the process number of the current process. Since 

process numbers are unique among all existing 
processes, this string of up to five digits is often used 
to generate unique names for temporary files. The 
UNIX system provides no mechanism for the 
automatic creation and deletion of temporary files. 
A file exists until it is explicitly removed. 
Temporary files are generally undesirable. The 
UNIX system pipe mechanism is far superior for 
many applications. However, the need for 
uniquely-named temporary files does occasionally 
occur. The following example also illustrates the 
recommended practice of creating temporary files in 
a directory used only for that purpose: 
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temp=$HOME/temp/$$ 

Is > $temp 

commands, some of which use $temp, go here 

rm $temp 

is the process number of the last process run in the 
background. Again, this is a string of up to five 
digits. 

is a string consisting of names of execution flags 
currently turned on in the shelL The $— variable 
has the value xv when tracing output. 



CONDITIONAL SUBSTITUTION 

Normally, the shell replaces occurrences of ^variable by the 
string value assigned to variable, if any. However, there exists 
a special notation to allow conditional substitution depending 
upon whether the variable is set and/or not null. By definition, 
a variable is set if it has ever been assigned a value. The value 
of a variable can be the null string which may be assigned to a 
variable in any one of the following ways: 



A= 

bcd=" " 
Ef_g=" 
set 

The first three of these examples assign the null string to each 
of the corresponding shell variables. The last example sets the 
first and second positional parameters to the null string and 
unsets all other positional parameters. 

The following conditional expressions depend upon whether a 
variable is set and not null. (Note that, in these expressions, 
variable refers to either a digit or a variable name. 
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% [variable'.— string] If variable is set and is non-null, then 
substitute the value ^variable in place of this expression. 
Otherwise, replace the expression with string. Note that 
the value of variable is not changed by the evaluation of 
this expression. 

%{variablei=string] If variable is set and is non-null, then 
substitute the value ^variable in place of this expression. 
Otherwise, set variable to string, and then substitute the 
value $ variable in place of this expression. Positional 
parameters may not be assigned values in this fashion. 

$ {variable:? string} If variable is set and is non-null, then 
substitute the value of variable for the expression. 
Otherwise, print a message of the form: 

variable: string 

and exit from the current shell. (If the shell is the 
login shell, it is not exited.) If string is omitted in this 
form, then the message 

variable: parameter null or not set 

is printed instead. 

%{variable:+ string} If variable is set and is non-null, then 
substitute string for this expression; otherwise, substitute 
the null string. Note that the value of variable is not 
altered by the evaluation of this expression. 

These expressions may also be used without the colon (:). In 
this case, the shell does not check whether variable is null or 
not. It only checks whether variable has ever been set. 

The two examples below illustrate the use of this facility: 
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1. If PATH has ever been set and is not null, then keep its 
current value. Otherwise, set it to the string 
:/bin:/usr/bin. Note that one needs an explicit 
assignment to set PATH in this form: 

PATH=${PATH:-':/bin:/usr/bin'} 

2. If HOME is set and is not null, then change directory to 
it; otherwise, set it to /usr/gas and change directory to 
it. Note that HOME is automatically assigned a value in 
this case: 

cd ${H0ME:=7usr/gas'} 



CONTROL COMMANDS 

The shell provides several commands that are useful in 
creating shell procedures. A few definitions are needed before 
explaining the commands. 



A simple command is defined as a sequence of nonblank 
arguments separated by blanks or tabs. The first argument 
usually specifies the name of the command to be executed. Any 
remaining arguments, with a few exceptions, are passed to the 
command. Input/output redirection arguments can appear in a 
simple command line and are passed to the shell, not to the 
command. 

A command is a simple command or any of the shell 
commands described below. A pipeline is a sequence of one or 
more commands separated by | . (For historical reasons, is a 
synonym for | in this context.) The standard output of each 
command but the last in a pipeline is connected [by a pipe(2)] 
to the standard input of the next command. Each command in 
a pipeline is run separately. The shell waits for the last 
command to finish. If no exit status argument is specified, the 
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exit status is that of the last command executed (an end-of-file 
will also cause the shell to exit). 

A command list is a sequence of one or more pipelines 
separated by ;, &, &&, or | | , and optionally terminated by ; 
or &. A semicolon (;) causes sequential execution of the 
previous pipeline (that is, the shell waits for the pipeline to 
finish before reading the next pipeline), while & causes 
asynchronous execution of the preceding pipeline. Both 
sequential and asynchronous execution are thus allowed. An 
asynchronous pipeline continues execution until it terminates 
voluntarily or until its processes are killed. 

More typical uses of & include off-line printing, background 
compilation, and generation of jobs to be sent to other 
computers. For example, typing 

nohup cc prog.c & 

allows one to continue working while the C compiler runs in the 
background. A command line ending with & is immune to 
interrupts and quits, but it is wise to make it immune to 
hang-ups as well. The nohup command is used for this 
purpose. Without nohup, if one hangs up while cc in the above 
example is still executing, cc will be killed and the output will 
disappear. 

The && and | | operators, which are of equal precedence (but 
lower than & and | ), cause conditional execution of pipelines. 
In cmdl I I cmd2, cmdl is executed and its exit status 
examined. Only if cmdl fails (i.e., has a nonzero exit status) is 
cmd2 executed. This is thus a more terse notation for: 
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if cmdl 

test $? != 
then 

cmd2 
fi 

The && operator yields the complementary test: in cmdl && 
cmd2, the second command is executed only if the first 
succeeds (has a zero exit status). In the sequence below, each 
command is executed in order until one fails: 

cmdl && cmd2 && cmd3 &&...&& cmdn 

A simple command in a pipeline may be replaced by a command 
list enclosed in either parentheses or braces. The output of all 
the commands so enclosed is combined into one stream that 
becomes the input to the next command in the pipeline. The 
following line prints two separate documents: 



{ nroff -cm textl; nroff -cm text2; } j col 



Programming Constructs 

Several control flow commands are provided in the shell that 
are especially useful in programming. These are referred to as 
programming constructs and are described below. 



A command often used with programming constructs is the 
test(l) command. An example of the use of the test command 
is: 

test -f file 

This command returns zero exit status (true) ii file exists and 
nonzero exit status otherwise. In general, test evaluates a 
predicate and returns the result as its exit status. Some of the 
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more frequently used test arguments are given below [see 
test(l) and "Test" under "SPECIAL COMMANDS" for more 
information!. 



test s 

test -f file 
test -r file 
test -w file 
test -d file 



true if the argument s is not 
the null string 

true iifile exists 

true iifile is readable 

true iifile is writable 

true iifile is a directory. 



Control Flow — while 

The actions of the for loop and the case branch are 
determined by data available to the shell. A while or until 
loop and an if then else branch are also provided whose 
actions are determined by the exit status returned by 
commands. A while loop has the general form: 

while command-listl 
do 

command-list2 
done 



The value tested by the while command is the exit status of 
the last simple command following while. Each time around 
the loop command-listl is executed. If a zero exit status is 
returned, then command-list2 is executed; otherwise, the loop 
stops. For example. 
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while test $1 
do 

shift 
done 

The shift command is a shell command that renames the 
positional parameters $2, $3, ... as $1, $2, ... and loses $1. 

Another use for the while/until loop is to wait until some 
external event occurs and then run some commands. In an 
until loop, the termination condition is reversed. For example, 

until test -f file 
do 

sleep 300 
done 
commands 

will loop until file exists. Each time round the loop, it waits for 
5 minutes (300 seconds) before trying again. (Presumably, 
another process will eventually create the file.) 

A file print could be written to use while and test as follows: 

while test $# != 
do 

echo " $1 being submitted" 

Ip -dprtd42 -c -ol2 -w -tuserl $1 

shift 
done 
Ipstat -oprtd42 
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Control Flow — if 

Also available is a general conditional branch of the form: 

if command-list 
then 

command-list 
else 

command-list 
fi 

that tests the value returned by the last simple command 
following if. If a zero exit status is returned, the command-list 
following the then is executed. If a zero exit status is not 
returned, the command-list following the else is executed. 

The if command may be used with the test command to test 
for the existence of a file as in: 

if test -f file 
then 

process file 
else 

do something else 
fi 

A multiple test if command of the form: 
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if... 
then 

else 

if... 
then 

else 
fi 



if 
fi 



fi 
may be written using an extension of the if notation as: 



if... 
then 

elif .. 
then 

elif .. 

fi 
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A file could be written to include the use of if and test as 

follows: 

if test $# = 
then 

echo " enter a filename after $0" 
else 

if [ ! -f $1 ] 

then 

echo " $1 does not exist" 

echo " Enter a filename that exists" ; exit 

else 

echo " $1 being submitted" 

Ip -dprtd42 -c -ol2 -w -tuserl $* 

Ipstat -oprtd42 

fi 
fi 

The [...] is shorthand for test. The if [ ! — f $1 ] means if the 
file $1 does not exist then do this. 

The sequence 

if commandl 
then 

command2 
fi 

may be written 

commandl && command2 
Conversely, 

commandl | | command2 
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executes coininand2 only if command 1 fails. In each case, 
the value returned is that of the last simple command executed. 



Control Flow — for 

A frequent use of shell procedures is to loop through the 
arguments ($1, $2, ...) executing commands once for each 
argument. An example of such a procedure is tel that searches 
the file /usrAib/telnos that contains lines of the form: 



fred mh0123 
bert mh0789 



The text of tel is: 

for i 
do 

grep $i /usr/lib/telnos 
done 

The command 

tel fred 

prints those lines in /usrAib/telnos that contain the string 
" fred" . 

The command 

tel fred bert 

prints those lines containing " fred" followed by those for 
" bert" . 
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The for loop notation is recognized by the shell and has the 
general form: 

for name in words 
do 

command-list 
done 

A command-list is a sequence of one or more simple 
commands separated or ended by a newline or a semicolon. A 
name is a shell variable that is set to words ... in turn each 
time the command-list following do is executed. If words ... 
is omitted, then the loop is executed once for each positional 
parameter; that is, in $* is assumed. Execution ends when 
there are no more words in the list. 



An example of the use of the for loop is the create command 
whose text is: 



for i do >$i; done 

The command 

create alpha beta 

ensures that two empty files alpha and beta exist and are 
empty. The notation >file may be used on its own to create or 
clear the contents of a file. Notice also that a semicolon (or 
newline) is required before done. 

The for can also be used in a program. Assume a document is 
formatted and stored in chapters (files) that begin with the 
letters " ch" (chl, ch2, ch3, and chtoc). A program can be 
written to send the document to the line printer. The program 
contains: 
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for i in ch* 
do 

Ip -d.prtd42 -c -ol2 -w -tuserl $i 
done 

Ipstat -oprtd42 

This will send each chapter as a separate job. Notice that $i is 
used instead of $*. 



Control Flow — case 

A multiple way (choice) branch is provided for by the case 
notation. For example, 

case $# in 

1) cat »$1 ;; 

2) cat »$2 <$1 ;; 

*) echo 'usage: append [ from ] to' ;; 
esac 



is an append command. (Note the use of semicolons to delimit 
the cases.) When called with one argument as in 

append file 

$# is the string "1", and the standard input is appended 
(copied) onto the end oifile using the cat(l) command. 

append filel file2 

appends the contents of filel onto file2. If the number of 
arguments supplied to append is other than 1 or 2, then a 
message is printed indicating proper usage. 

The general form of the case command is 
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case word in 

pattern Ipattern) command-list;; 

esac 

The shell attempts to match word with each pattern in the 
order that the patterns appear. If a match is found, the 
associated command-list is executed; and execution of the 
case is complete. Since * is the pattern that matches any 
string, it can be used for the default case. 



Caution: No check is made to ensure that only one 
pattern matches the case argument. 



The first match found defines the set of commands to be 
executed. In the example below, the commands following the 
second "*" will never be executed since the first "*" executes 
everything it receives. 



case w m 

*) 
/ •• • )f 

*\ 

esac 



A program print can be used to send a document to different 
line printers. Assume there are two line printers named 
"prtd42" and " prtd43" . Send a document to " prtd42" as 
follows: 



print 42 files 
Send a document to " prtd43" as follows: 
print 43 files 
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The print program contains the following: 

case $1 in 

42) shift ;lp -dprtd42 -c -ol2 -w -tuserl $*;lpstat -oprtd42;; 

43) shift ;lp -dprtd43 -c -ol2 -w -tuserl $*;lpstat -oprtd43;; 
*) echo " line printer does not exist" ;; 

esac 

Another example of the use of the case construction is to 
distinguish between different forms of an argument. The 
following example is a fragment of a cc(l) command. 

for i 
do 

case $i in 

-[ocs]) ... ;; 

-*) echo 'unknown flag $i' ;; 
*.c) /lib/cO$i...;; 
*) echo 'unexpected argument $i' ;; 
esac 
done 

To allow the same commands to be associated with more than 
one pattern, the case command provides for alternative 
patterns separated by a | . For example, 

case $i in 

-x|-y)... 
esac 

is equivalent to 

case $i in 

-[xy])... 
esac 
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The usual quoting conventions apply so that 
case $i in 

will match the character ?. 



SPECIAL COMMANDS 

There are several special commands that are internal to the 
shell (some of which have already been mentioned). These 
commands should be used in preference to other UNIX system 
commands whenever possible because they are faster and more 
efficient. The shell does not fork to execute these commands, 
so no additional processes are spawned. 

Many of these special commands were described in Chapter 10. 
These commands include: 

cd 

exec 

hash 

newgrp 

pwd 

set 

type 

ulimit 

umask 

unset. 



Descriptions of the remaining special commands follow. These 
commands include: 
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break 

continue 

echo 

eval 

exit 

export 

read 

readonly 

return 

shift 

test 

times 

trap 

wait. 



: (Colon) 

The : command is the null command. This command can be 
used to return a zero (true) exit status. 

. (Period) 

The . command has the form: 



file 



This command reads and executes commands from file and 
returns. The search path specified by PATH is used to find the 
directory containing file. If the file commandl contained the 
following 



echo Today is: 
date 



then the command 
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. commandl 

returns 

Today is: 

Thu Sep 22 14:40:04 EDT 1984 

Any currently defined variable can be used in the shell 
procedure called. 

break 

This command has the form: 

break [n] 

This command is used to exit from the enclosing for, until, or 
while loop. If n is specified, then exit n levels. An example of 
break is as follows: 

# This procedure is interactive; the 'break' 

# command is used to allow 

# the user to control data entry, 
while true 

do 

echo " Please enter data" 

read response 

case " $response" in 

" done" ) break # no more data 

*) 

process the data here 



esac 



done 
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continue 

This command has the form: 

continue [n] 

This command causes the resumption of an enclosing for, 
until, or while loop. If n is specified, then it resumes at the 
n-th. enclosing loop. 

echo 

The form of the echo command is: 

echo [arg ...] 

The echo command writes its arguments separated by blanks 
and terminated by a newline on the standard output. For 
instance, the input: 

echo Message to be printed, 
returns 

Message to be printed. 
The following escapes can be used with echo: 



11-27 



SHELL PROGRAMMING 

\b backspace 

\c print line without new-line 

\f new-line 

\r carriage return 

\t tab 

\ backslash 

\n the 8-bit character whose ASCII code is the 1-, 

2-, or 3-digit octal number, which must start 

with a zero. 
\v vertical tab 

For example, 

echo " The current date is \c" 
date 

would return 

The current date is Tue May 16 08:00:30 EDT 1984 

eval 

Sometimes, one builds command lines inside a shell procedure. 
In this case, one might want to have the shell rescan the 
command line after all the initial substitutions and expansions 
are done. The special command eval is available for this 
purpose. The form of this command is: 

eval [arg ...] 

The eval command takes a command line as its argument and 
simply rescans the line performing any variable or command 
substitutions that are specified. Consider the following 
situation: 
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command=who 
output='|wc -1' 
eval $command $output 

This segment of code results in the pipeline who | wc —1 being 
executed. 



The uses of eval can be nested. 



exit 

A shell program may be terminated at any place by using the 
exit command. The form of the exit command is: 



exit [n] 

The exit command can also be used to pass a return code (n) to 
the shell. By convention, a return code means true and a 1 
to 255 return code means false. The return code can be found 
by $?. For instance, if the executable procedure testexit 
contained 

exit 5 
then 

testexit 
would execute testexit. The command: 

echo $? 
would return 
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export 

The form of the export command is: 

export [name ...] 

The export command places the named variables in the 
environments of both the shell and all its future child 
processes. Normally, all variables are local to the shell 
program. Commands executed from within the shell program 
do not have access to the local variables. If a variable is 
exported, then the commands within the shell program will 
be able to access the variable. 

To export variables, the following command is used 

export variablel variable2 ... 

To obtain a list of variables exported, the following command 
is entered 

export 



read 

A variable may also be set using the read command. The 
read command reads one line from the standard input of the 
shell procedure and puts that line in the variables which are 
its arguments. Leading spaces and tabs are stripped off. The 
general form of the command is: 

read variablel variable2 ... 
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The last variable gets what is left over. For example, if 
testread contains the following 

echo 'Please type your first and last name:\c' 

read first_name last_name 

echo Your name is ${first_name} ${last_name} 

then when the program is run the first line would be printed: 

Please type your first and last name: 

and would wait for the input. (The input would appear on the 
same line.) Assuming the name is Jane Doe, after the input, 
the following line would be printed: 

Your name is Jane Doe 

readonly 

Variables can be made readonly. After becoming readonly, a 
variable cannot receive a new value. The general form of the 
command is 

readonly variable-name variable-name ... 

To print the names of variables that are readonly, enter 

readonly 
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return 



The return command causes a function to exit with a specified 
return value. The form of the command is: 



return [n] 

where n is the desired return value. When n is omitted, the 
return status of the last command executed is displayed. 



shift 

The shift[sh(l)] command reassigns the positional parameters. 
Positional parameter $1 would receive the value of $2, $2 
would receive the value of $3, etc. Notice that $0 (the 
procedure name) is unchanged and that the number of 
positional parameters ($#) is decremented. 

If the executable program shifter contains the following: 

echo ${#} positional parameters 

echo ${*} 

echo Now shift 

shift 

echo ${#} positional parameters 

echo ${*} 

then the command: 

shifter first second third 

would result in 
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3 positional parameters 
first second third 
Now shift 

2 positional parameters 
second third 



test 

The test(l) command evaluates the expression specified by its 
arguments and, if the expression is true, returns a zero exit 
status. Otherwise, a nonzero (false) exit status is returned. The 
test command also returns a nonzero exit status if it has no 
arguments. Often it is convenient to use the test command as 
the first command in the command list following an if or a 
while. Shell variables used in test expressions should be 
enclosed in double quotes if there is any chance of their being 
null or not set. 

The square brackets ([]) may be used as an alias for test; e.g., 
[ expression ] has the same effect as test expression. 

The following is a partial list of the primaries that can be used 
to construct a conditional expression: 

—r file true if the named file exists and is 

readable by the user. 

—w file true if the named file exists and is 

writable by the user. 

—Si file true if the named file exists and is 

executable by the user. 

— s file true if the named file exists and has a size 

greater than zero. 
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—dfile true if the named file exists and is a 

directory. 

—ffile true if the named file exists and is an 

ordinary file. 

—pfile true if the named file exists and is a 

named pipe (fifo). 

— z si true if the length of string " si" is zero. 

— n si true if the length of the string " si" is 

nonzero. 

—tfildes true if the open file whose file descriptor 

number is fildes is associated with a 
terminal device. If fildes is not specified, 
file descriptor 1 is used by default. 

si = s2 true if strings " si" and " s2" are 

identical. 

si != s2 true if strings " si" and " s2" are not 

identical. 



si true if " si" is not the null string. 

nl — eq n2 true if the integers nl and n2 are 

algebraically equal. Other algebraic 
comparisons are indicated by — ne, — gt, 
-ge, -It, and — le. 

These primaries may be combined with the following operators: 
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! unary negation operator. 

—a binary logical and operator. 



— o 



binary logical or operator. The — o has 
lower precedence than —a. 



( expr ) parentheses for grouping; they must be 

escaped to remove their significance to 
the shell. When parentheses are absent, 
the evaluation proceeds from left to right. 

Note that all primaries, operators, file names, etc. are separate 
arguments to test. 

For example, consider the procedure nametest: 

iftest-d$l 

then echo $1 is a directory 
elif test -f $1 

then echo $1 is a file 
else echo $1 does not exist 
fi 

If the file bucket existed, then 

bucket is a file 

would be returned. 
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times 



The times command prints the accumulated user and system 
times for processes run from the shell. The times command is 
entered on a line by itself. For example, the command 

times 

returns 

0m3s OmlOs 



trap 

A shell program may handle interrupts by using the trap 
command. The trap command interfaces with the underlying 
UNIX operating system mechanism for handling interupts. 

The UNIX operating system provides signals that tell a 
program when some unusual condition has occurred. These 
signals may be from the keyboard or from other programs. 

By default, if a program receives a signal, the program will 
terminate. However, these signals may be caught, the program 
suspended, the interrupt routine run, and the program 
restarted at the point it was suspended. Or these signals may 
be ignored. 

trap arg signal-list 

is the form of the trap command, where arg is a string to be 
interpreted as a command list and signal-list consists of one or 
more signal numbers [as described in signal(2)]. 
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The following signals are used in the UNIX system: 



01 


hangup 


02 


interrupt 


03 


quit 


04 


illegal instruction 


05 


trace trap 


06 


lOT instruction 


07 


EMT instruction 


08 


floating point exception 


09 


kill 


10 


bus error 


11 


segmentation violation 


12 


bad argument to system call 


13 


write on a pipe with no one to read it 


14 


alarm clock 


15 


software termination signal 


16 


user defined signal 1 


17 


user defined signal 2 


18 


death of a child 


19 


power fail 


20 


window change 


21 


handset line status change. 



The commands in arg are scanned at least once when the shell 
first encounters the trap command. Because of this, it is 
usually wise to use single rather than double quotes to 
surround these commands. The single quotes inhibit immediate 
command and variable substitution. This becomes important, 
for instance, when one wishes to remove temporary files and 
the names of those files have not yet been determined when the 
trap command is first read by the shell. The following 
procedure will print the name of the current directory on the 
file errdirect when it is interrupted, thus giving the user 
information as to how much of the job was done: 
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trap 'echo ^pwcT >errdirect' 2 3 15 
for i in /bin /usr/bin /usr/gas/bin 
do 

cd $i 
commands to be executed in directory $i here 
done 

The same procedure with double (rather than single) quotes 
(trap " echo ^pwd^ >errdirect" 2 3 15) will, instead, print 
the name of the directory from which the procedure was 
executed. 

Signal 11 (SEGMENTATION VIOLATION) may never be 
trapped because the shell itself needs to catch it to deal with 
memory allocation. Zero is not a UNIX system signal. Zero is 
effectively interpreted by the trap command as a signal 
generated by exiting from a shell (either via an exit command 
or by "falling through" the end of a procedure). If arg is not 
specified, then the action taken upon receipt of any of the 
signals in signal-list is reset to the default system action. If arg 
is an explicit null string (" or " " ), then the signals in signal-list 
are ignored by the shell. 

The most frequent use of trap is to assure removal of 
temporary files upon termination of a procedure. The second 
example of "Predefined Special Variables" in subpart D, "Shell 
Variables," would be written more typically as follows: 

temp=$HOME/temp/$$ 

trap 'rm $temp; trap 0; exit' 1 2 3 15 

Is > $temp 

commands, some of which use $temp, go here 

In this example whenever signals 1 (HANGUP), 2 
(INTERRUPT), 3 (QUIT), or 15 (SOFTWARE TERMINATION) 
are received by the shell procedure or whenever the shell 
procedure is about to exit, the commands enclosed between the 
single quotes will be executed. The exit command must be 
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included or else the shell continues reading commands where it 
left off when the signal was received. The trap turns off the 
original trap on exits from the shell so that the exit command 
does not reactivate the execution of the trap commands. 

Sometimes it is useful to take advantage of the fact that the 
shell continues reading commands after executing the trap 
commands. The following procedure takes each directory in the 
current directory, changes to it, prompts with its name, and 
executes commands typed at the terminal until an end-of-file 
(control-d) or an interrupt is received. An end-of-file causes 
the read command to return a nonzero exit status, thus 
terminating the while loop and restarting the cycle for the 
next directory. The entire procedure is terminated if 
interrupted when waiting for input; but during the execution of 
a command, an interrupt terminates only that command. 

dir='^pwd^ 
for i in * 
do 

if test -d $dir/$i 
then 

cd $dir/$i 
while echo " $i:" 
trap exit 2 
read x 
do 

trap : 2 # ignore interrupts 
eval $x 
done 
fi 
done 

Several traps may be in effect at the same time. If multiple 
signals are received simultaneously, they are serviced in 
ascending order. To check what traps are currently set, type: 
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trap 



It is important to understand some things about the way the 
shell implements the trap command in order not to be 
surprised. When a signal (other than 11) is received by the 
shell, it is passed on to whatever child processes are currently 
executing. When those (synchronous) processes terminate, 
normally or abnormally, the shell then polls any traps that 
happen to be set and executes the appropriate trap commands. 
This process is straightforward except in the case of traps set 
at the command (outermost or login) level. In this case, it is 
possible that no child process is running, so the shell waits for 
the termination of the first process spawned after the signal is 
received before it polls the traps. 

For internal commands, the shell normally polls traps on 
completion of the command. An exception to this rule is made 
for the read, hash, and echo commands. 



wait 

The wait command has the following form 

wait [n] 

With this command, the shell waits for the child process whose 
process number is n to terminate. The exit status of the wait 
command is that of the process waited on. If n is omitted or is 
not a child of the current shell, then all currently active 
processes are waited for and the return code of the wait 
command is zero. For example, the executable program 
format: 
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while test " $1" !="" 

nroff $l»junk& 

shift 

wait $! 

done 

echo ***nroff complete*** 

envokes the nroff formatter for each file specified and informs 
the user when it is finished. If the files chapterl and chapter2 
required formatting, the entry: 

format chapterl chapter2 

would format the two chapters and when they are finished 
return 

***nroff complete*** 

COMMAND GROUPING 

Commands may be grouped in two ways 

{ command-list ; } 

and 

( command-list ) 

The first form, command-list, is simply executed. The second 
form executes command-list as a separate process. If a list of 
commands is enclosed in a pair of parentheses, the list is 
executed as a subshell. The subshell inherits the environment 
of the main shell. The subshell does not change the 
environment of the main shell. For example, 
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(cd x; rm junk) 

executes rm junk in the directory x without changing the 
current directory of the invoking shell. 

The commands 

cd x; rm junk 

have the same effect but leave the invoking shell in the 
directory x. 



A COMMANDOS ENVIRONMENT 

All the variables (with their associated values) known to a 
command at the beginning of execution of that command 
constitute its environment. This environment includes 
variables that the command inherits from its parent process 
and variables specified as keyword parameters on the command 
line that invokes the command. 



The variables that a shell passes to its child processes are 
those that have been named as arguments to the export 
command. The export command places the named variables in 
the environments of both the shell and its future child 
processes. 

Keyword parameters are variable-value pairs that appear in the 
form of assignments, normally before the procedure name on a 
command line. Such variables are placed in the environment of 
the procedure being invoked. For example, 

# key_command 

echo $a $b 
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is a simple procedure that echoes the values of two variables. 
If it is invoked as 



a=keyl b=key2 key_command 

then the output is 

keyl key2 

A procedure's keyword parameters are not included in the 
argument count $#. 

A procedure may access the value of any variable in its 
environment. However, if changes are made to the value of a 
variable, these changes are not reflected in the environment. 
The changes are local to the procedure in question. In order for 
these changes to be placed in the environment that the 
procedure passes to its child processes, the variable must be 
named as an argument to the export command within that 
procedure. To obtain a list of variables that have been made 
exportable from the current shell, type: 

export 

To get a list of name-value pairs in the current environment, 
type: 



env 
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DEBUGGING SHELL PROCEDURES 

The shell provides two tracing mechanisms to help when 
debugging shell procedures. The first is invoked within the 
procedure as 



set -V 

(v for verbose) and causes lines of the procedure to be printed 
as they are read. It is useful to help isolate syntax errors. It 
may be invoked without changing the procedure by entering: 

sh -V proc ... 

where proc is the name of the shell procedure. This flag may 
be used with the — n flag to prevent execution of later 
commands. (Note that typing "set — n" at a terminal will 
render the terminal useless until an end-of-file is typed.) 

The command: 

set -X 

will produce an execution trace with flag — x. Following 
parameter substitution, each command is printed as it is 
executed. (Try the above at the terminal to see the effect it 
has.) Both flags may be turned off by typing: 

set - 

and the current setting of the shell flags is availably as $— . 
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Chapter 12 
EXAMPLES OF SHELL PROCEDURES 

Some examples in this subpart are quite difficult for beginners. 
For ease of reference, the examples are arranged alphabetically 
by name, rather than by degree of difficulty. 



copypairs 

# usage: copypairs filel file2 ... 

# copy filel to file2, fileS to file4, ... 
while test " $2" != " " 

do 

cp $1 $2 

shift; shift 
done 

if test "$1" !="" 
then 

echo " $0: odd number of arguments" 
fi 



Note: This procedure illustrates the use of a while 
loop to process a list of positional parameters that are 
somehow related to one another. Here a while loop is 
much better than a for loop because you can adjust the 
positional parameters via shift to handle related 
arguments. 
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copyto 

# usage: copyto dir file ... 

# copy argument files to 'dir', 

# making sure that at least 

# two arguments exist and that 'dir' 

# is a directory 
if test $# -It 2 
then 

echo " $0: usage: copyto directory file ..." 
elif test ! -d $1 
then 

echo " $0: $1 is not a directory" ; 
else 

dir=$l; shift 
for eachfile 
do 

cp $eachfile $dir 
done 
fi 



Note: This procedure uses an if command with two 
tests in order to screen out improper usage. The for 
loop at the end of the procedure loops over all of the 
arguments to copyto but the first. The original $1 is 
shifted off. 
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distinct 



# usage: distinct 

# reads standard input and reports 

# list of alphanumeric strings 

# that differ only in case, 

# giving lower-case form of each 

tr -cs '[A-Z][a-z][0-9]' '[\012*]' | sort -u 
tr '[A-Z]' '[a-z]' | sort | uniq -d 



Note: This procedure is an example of the kind of 
process that is created by the left-to-right construction 
of a long pipeline. It may not be immediately obvious 
how this works. [See tr(l), sort(l), and uniq(l) if you 
are completely unfamiliar with these commands.] The 
tr translates all characters except letters and digits into 
newline characters and then squeezes out repeated 
newline characters. This leaves each string (in this case, 
any contiguous sequence of letters and digits) on a 
separate line. The sort command sorts the lines and 
emits only one line from any sequence of one or more 
repeated lines. The next tr converts everything to 
lowercase so that identifiers differing only in case 
become identical. The output is sorted again to bring 
such duplicates together. The uniq — d prints (once) 
only those lines that occur more than once yielding the 
desired list. 



The process of building such a pipeline uses the fact that pipes 
and files can usually be interchanged. The two lines below are 
equivalent assuming that sufficient disk space is available: 

cmdl I cmd2 | cmd3 
cmdl>templ;cmd2<templ>temp2;cmd3<temp2;rm temp [12] 

Starting with a file of test data on the standard input and 
working from left to right, each command is executed taking its 
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input from the previous file and putting its output in the next 
file. The final output is then examined to make sure that it 
contains the expected result. The goal is to create a series of 
transformations that will convert the input to the desired 
output. As an exercise, try to mimic distinct with such a 
step-by-step process using a file of test data containing: 

ABC:DEF/DEF 
ABCl ABC 
Abe abc 



Although pipelines can give a concise notation for complex 
processes, exercise some restraint lest you succumb to the 
"one-line syndrome" sometimes found among users of especially 
concise languages. This syndrome often yields 
incomprehensible code. 



draft 

# usage: draft file(s) 

# prints the draft (-rC3) of a document on a DASI 450 

# terminal in 12-pitch using memorandum macros (MM). 
nroff -rC3 -T450-12 -cm $* 



Note: Users often write this kind of procedure for 
convenience in dealing with commands that require the 
use of many distinct flags. These flags cannot be given 
default values that are reasonable for all (or even most) 
users. 
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edfind 

# usage: edfind file arg 

# find the last occurrence in 'file' of a line whose 

# beginning matches 'arg', then print 3 lines (the one 

# before, the line itself, and the one after) 
ed - $1 «! 

H 

?'$2?;-,+p 

t 



Note: This procedure illustrates the practice of using 
editor (ed) inline input scripts into which the shell can 
substitute the values of variables. It is a good idea to 
turn on the H option of ed when embedding an ed 
script in a shell procedure [see ed(l)]. 



edlast 

# usage: edlast file 

# prints the last line of file, then deletes that line 

ed - $1 «-\eof # no variable substitutions in " ed" script 
H 

$P 
$d 
w 

q 

eof 

echo Done. 



Note: This procedure contains an in-line input 
document or script; it also illustrates the effect of 
inhibiting substitution by escaping a character in the 
eofstring (here, eof) of the input redirection. If this had 
not been done, $p and $d would have been treated as 
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shell variables. 



fsplit 

i usage: fsplit filel file2 

# read standard input and divide it into three parts: 

# append any line containing at least one letter 

# to filel, any line containing at least one digit 

# bat no letters to file2, and throw the rest away 
total=0 lost=0 

while read next 
do 

total=" 'expr $total + 1'" 
case " $next" in 
*[A-Za-z]*) 

echo " $next" » $1 ;; 
*[0-9]*) 

echo " $next" » $2 ;; 

*) 

lost=" 'expr $lost + 1'" 

esac 
done 
echo " $total lines read, $lost thrown away" 



Note: In this procedure, each iteration of the while 
loop reads a line from the input and analyzes it. The 
loop terminates only when read encounters an 
end-of-file. 



Do not use the shell to read a line at a time unless you 
must — it can be grotesquely slow. 
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initvars 



# usage: . initvars 

# use carriage return to indicate " no change" 
echo " initializations? \c" 

read response 

if test " $response" = y 

then 

echo " PSl=\c" ; read temp 

PSl=${temp:-$PSl} 
echo " PS2=\c" ; read temp 
PS2=${temp:-$PS2} 
echo " PATH=\c" ; read temp 

PATH=${temp:-$PATH} 
echo " TERM=\c" ; read temp 
TERM=${temp:-$TERM} 
fi 



Note: This procedure would be invoked by a user at the 
terminal or as part of a file. The assignments are 
effective even when the procedure is finished because the 
dot command is used to invoke it. To better understand 
the dot command, invoke initvars as indicated above 
and check the values of PSl, PS2, PATH, and TERM; 
then make initvars executable, type initvars, assign 
different values to the three variables, and check again 
the values of these three shell variables after initvars 
terminates. It is assumed that PSl, PS2, PATH, and 
TERM have been exported, presumably by your 
.profile. 
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merge 

# usage: merge srcl src2 [ dest ] 

# merge two files, every other line. 

# the first argument starts off the merge, 

# excess lines of the longer 

# file are appended to 

# the end of the resultant file 
exec 4<$1 5<$2 

dest=${3-$l.m}# default destination file is named $l.m 

while true 

do 

# alternate reading from the files; 

# 'more' represents the file descriptor 

# of the longer file 

line <&4 »$dest | | { more=5; break ;} 
line «S;5 »$dest | | { more=4; break ;} 
done 

# delete the last line of destination 

# file, because it is blank, 
ed - $dest «\eof 

H 
$d 
w 

q 

eof 

while line <&$more » $dest 

do :; done # read the remainder of the longer 

# file— the body of the 'while' loop 

# does nothing; the work of the loop 

# is done in the command list following 

# 'while' 



Note: This procedure illustrates a technique for 
reading sequential lines from a file or files without 
creating any subshells to do so. When the file descriptor 
is used to access a file, the effect is that of opening the 
file and moving a file pointer along until the end of the 
file is read. If the input redirections used srcl and 
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src2 explicitly rather than the associated file 
descriptors, this procedure would never terminate 
because the first line of each file would be read over and 
over again. 



mkfiles 

# usage: mkfiles pref [ quantity ] 

# makes 'quantity' (default = 5) files, 

# named prefl, pref2, ... 
quantity=${2-5} 

i=l 

while test " $i" -le " $quantity" 

do 

> $l$i 

i=" *expr $i + 1'" 
done 



Note: This procedure uses input/output redirection to 
create zero-length files. The expr command is used for 
counting iterations of the while loop. Compare this 
procedure with procedure null below. 
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mmt 



if test " $#" = 0; then cat «\! 

Usage: " mmt [ options ] files" where " options" are: 

-a => output to terminal 

-e => preprocess input with eqn 

-t => preprocess input with tbl 

-Tst => output to STARE phototypesetter by Honeywell 

-T4014 => output to 4014 manufactured by Tektronix 

-Tvp => output to printer manufactured by Versatec 

- => use instead of " files" when mmt used inside a pipeline. 

Other options as required by TROFF and the MM macros. 

! 

exit 1 
fi 
PATH='/bin:/usr/bin'; 0='-g'; o=' | gcat -ph'; 

# Assumes typesetter is accessed via gcat(l) 

# If typesetter is on-line, use 0="; o=" 
while test -n " $1" -a ! -r " $1" 

do 

case"$l" in 

-a) 0='-a'; o=" ;; 

-Tst) 0='-g'; 0=' I gcat -st';; 

# Above line for STARE only 
-T4014) 0='-t'; 0=' | tc';; 
-Tvp) 0='-t'; 0=' I vpr -t';; 
-e) e='eqn';; 

-t) f^'tbl';; 

-) break;; 

) a^^^ jpa «P-L ;; 

esac 
shift 
done 

if test -z " $1" 
then 

echo 'mmt: no input file' 
exit 1 
fi 

if test " $0" = '-g' 
then 
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x=" -f$l" 
fi 

d=" $*" 

if test " $d" = '-' 
then 

shift 

x=" 

d=" 
fi 

if test -n " $f 
then 

f=" tbl $* I " 

d=" 
fi 

if test -n " $e" 
then 

if test -n " If- 
then e='eqn | ' 

else e=" eqn $* | " 

d=" 
fi 
fi 
eval " $f $e troff $0 -cm $a $d $o $x" ; exit 



Note: This is a slightly simplified version of an actual 
UNIX system command. It uses many of the features 
available in the shell. If you can follow through it 
without getting lost, you have a good understanding of 
shell programming. Pay particular attention to the 
process of building a command line from shell variables 
and then using eval to execute it. 
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null 

# usage: null file 

# create each of the named files 

# as an empty file 
for eachfile 

do 

> $eachfile 
done 



Note: This procedure uses the fact that output 
redirection creates the (empty) output file if that file 
does not already exist. Compare this procedure with 
procedure mkfiles above. 



phone 

# usage: phone initials 

# prints the phone number(s) of person 

# with given initials 
echo 'inits ext home' 
grep"^$l" «\! 



abc 


1234 


999-2345 


def 


2234 


583-2245 


ghi 


3342 


988-1010 


xyz 


4567 


555-1234 



Note: This procedure is an example of using an inline 
input document or script to maintain a STnall data base. 
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writemail 

# usage: writemail message user 

# if user is logged in, write message on terminal; 

# otherwise, mail it to user 

echo " $1" I { write " $2" | | mail " $2" ;} 



Note: This procedure illustrates command grouping. 
The message specified by $1 is piped to the write 
command and, if write fails, to the mail command. 
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Chapter 13 

A PROGRAM FOR MAINTAINING 
COMPUTER PROGRAMS— "make" 



GENERAL 

In a programming project, a common practice is to divide large 
programs into smaller pieces that are more manageable. The 
pieces may require several different treatments such as being 
processed by a macro processor or sophisticated program 
generators (e.g., Yacc or Lex). The project continues to 
become more complex as the output of these generators is 
compiled with special options and with certain definitions and 
declarations. A sequence of code transformations develops 
which is difficult to remember. The resulting code may need 
further transformation by loading the code with certain 
libraries under control of special options. Related maintenance 
activities also complicate the process further by running test 
scripts and installing validated modules. Another activity that 
complicates program development is a long editing session. A 
programmer may lose track of the files changed and the object 
modules still valid, especially when a change to a declaration 
can make a dozen other files obsolete. The programmer must 
also remember to compile a routine that has been changed or 
that uses changed declarations. 

The "make" command is a software tool that maintains, 
updates, and regenerates groups of computer programs. 

A programmer can easily forget 

• Files that are dependent upon other files. 

• Files that were modified recently. 
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• Files that need to be reprocessed or recompiled after a 
change in the source. 

• The exact sequence of operations needed to make and 
exercise a new version of the program. 

The many activities of program development and maintenance 
are made simpler by the make program. 

The make program provides a method for maintaining up-to- 
date versions of programs that result from many operations on 
a number of files. The make program keeps track of the 
sequence of commands that create certain files and the list of 
files that require other files to be current before the operations 
can be done. Whenever a change is made in any part of a 
program, the make command creates the proper files simply, 
correctly, and with a minimum amount of effort. The make 
program also provides a simple macro substitution facility and 
the ability to encapsulate commands in a single file for 
convenient administration. 



The basic operation of make is to 

• Find the name of the needed target file in the description. 

• Ensure that all of the files, on which it depends, exist and 
are up to date. 

• Create the target file if it has not been modified since its 
generators were modified. 

The descriptor file really defines the graph of dependencies. 
The make program determines the necessary work by 
performing a depth-first search of the graph of dependencies. 

If the information on interfile dependencies and command 
sequences is stored in a file (makefile or Makefile), the simple 
command 
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is frequently sufficient to update the interesting files regardless 
of the number edited since the last make. In most cases, the 
description file is easy to write and changes infrequently. It is 
usually easier to type the make command than to issue even 
one of the needed operations, so the typical cycle of program 
development operations becomes 

think - edit - make - test . . . 

The make program is most useful for medium-sized 
programming projects. The make program does not solve the 
problems of maintaining multiple source versions or of 
describing huge programs. 



BASIC FEATURES 

The basic operation of make is to update a target file by 
ensuring that all of the files on which the target file depends 
exist and are up to date. The target file is created if it has not 
been modified since the dependents were modified. The make 
program does a depth-first search of the graph of dependencies. 
The operation of the command depends on the ability to find 
the date and time that a file was last modified. 



To illustrate, consider a simple example in which a program 
named prog is made by compiling and loading three C language 
files x.c, y.c, and zx with the Id library. By convention, the 
output of the C language compilations will be found in files 
named x.o, y.o, and z.o. Assume that the files x.c and yx share 
some declarations in a file named defs, but that z.c does not. 
That is, x.c and y.c have the line 

#include " defs" 
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The following text describes the relationships and operations: 



prog : x.o y.o z.o 

cc x.o y.o z.o -lid -o prog 

x.o y.o : defs 

If this information were stored in a file named makefile, the 
command 



make 



would perform the operations needed to recreate prog after any 
changes had been made to any of the four source files x.c, y.c, 
z.c, or defs. 



The make program operates using the following three sources 
of information: 



• A user-supplied description file 

• File names and "last-modified" times from the file system 

• Built-in rules to bridge some of the gaps. 

In the example, the first line states that prog depends on three 
".o" files. Once these object files are current, the second line 
describes how to load them to create prog. The third line states 
that x.o and y.o depend on the file defs. From the file system, 
make discovers that there are three ".c" files corresponding to 
the needed ".o" files and uses built-in information on how to 
generate an object from a source file (i.e., issue a "cc -c" 
command). 

By not taking advantage of mtake's innate knowledge, the 
following longer descriptive file results. 
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prog : x.o y.o z.o 

cc x.o y.o z.o -Ud -o prog 
x.o : x.c defs 

cc -c x.c 
y.o : y.c defs 

cc -c y.c 
z.o : z.c 

cc -c z.c 

If none of the source or object files have changed since the last 
time prog was made, all of the files are current, and the 
command 



make 



announces this fact and stops. If, however, the defs file has 
been edited, x.c and y.c (but not z.c ) are recompiled; and then 
jyrog is created from the new ".o" files. If only the file y.c had 
changed, only it is recompiled; but it is still necessary to reload 
prog. If no target name is given on the make command line, 
the first target mentioned in the description is created; 
otherwise, the specified targets are made. The command 

make x.o 

would recompile x.o if x.c or defs had changed. 

A method, often useful to programmers, is to include rules with 
mnemonic names and commands that do not actually produce a 
file with that name. These entries can take advantage of 
make's ability to generate files and substitute macros. Thus, 
an entry "save" might be included to copy a certain set of files, 
or an entry "cleanup" might be used to throw away unneeded 
intermediate files. 



If the file exists after the commands are executed, the file's 
time of last modification is used in further decisions. If the file 
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does not exist after the commands are executed, the current 
time is used in making further decisions. 

You may maintain a zero-length file purely to keep track of the 
time at which certain actions were performed. This technique 
is useful for maintaining remote archives and listings. 

A simple macro mechanism for substituting in dependency lines 
and command strings is used by the make program. Macros 
are defined by command arguments or description file lines 
with embedded equal signs. A macro is invoked by preceding 
the name by a dollar sign. Macro names longer than one 
character must be parenthesized. The name of the macro is 
either the single character after the dollar sign or a name 
inside parentheses. The following are valid macro invocations: 

$(CFLAGS) 

$2 

$(xy) 
$Z 
$(Z) 

The last two invocations are identical. A $$ is a dollar sign. 

The $*, $@, $?, and $< are four special macros which change 
values during the execution of the command. (These four 
macros are described in the part "DESCRIPTION FILES AND 
SUBSTITUTIONS".) The following fragment shows assignment 
and use of some macros: 

OBJECTS = x.o y.o z.o 
LIBES = -lid 
prog: $(OBJECTS) 

cc$(OBJECTS) $(LIBES) -o prog 
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The command 

make " LIBES= -11 -lid" 

loads the three objects with the Lex (-11) library since macro 
definitions on the command line override definitions in the 
description. Arguments must be quoted with embedded blanks 
in UNIX software commands. 

As an example of the use of make, the description file used to 
maintain the make command is given. The code for make is 
spread over a number of C language source files and a Yacc 
grammar. The description file contains: 

# Description file for the Make command 

p = lp 

FILES = Makefile version.c defs main.c doname.c 
misc.c files.c dosys.c gram.y lex.c gcos.c 
OBJECTS = version.o main.o doname.o misc.o files.o 

dosys.o gram.o 
LIBES= -lid 
LINT = lint -p 
CFLAGS = -0 

make: $(OBJECTS) 

cc $(CFLAGS) $(OBJECTS) $(LIBES) -o make 
@size make 



$(OBJECTS): 
gram.o: lex.c 


defs 


cleanup: 

-rm *.o 
-du 


gram.c 


install: 





@size make /usr/bin/make 

cp make /usr/bin/make ; rm make 
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print: $(FILES) # print recently changed files 

pr$?|$P 
touch print 

test: 

make -dp | grep -v TIME >lzap 
/usr/bin/make -dp | grep -v TIME >2zap 
diff Izap 2zap 
rm Izap 2zap 

lint : dosys.c doname.c files.c main.c misc.c version.c \ 
gram.c 

$(LINT) dosys.c doname.c files.c main.c misc.c \ 
version.c gram.c 



arch: 



ar uv /sys/source/s2/make.a $(FILES) 



ine maKe program usually prints out eacn commana Detore 
issuing it. 

The following output results from typing the simple command 
make in a directory containing only the source and description 
files: 



cc -0 -c version.c 

cc -0 -c main.c 

cc -0 -c doname.c 

cc -0 -c misc.c 

cc -0 -c files.c 

cc -0 -c dosys.c 

yacc gram.y 

mv y.tab.c gram.c 

cc -O -c gram.c 

cc version.o main.o doname.o misc.o files.o dosys.o 

gram.c -lid -o make 
13188+3348+3044 = 19580b = 046174b 
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Although none of the source files or grammars were mentioned 
by name in the description file, make found them using its 
suffix rules and issued the needed commands. The string of 
digits results from the size make command. The printing of 
the command line itself was suppressed by an @ sign. The @ 
sign on the size command in the description file suppressed the 
printing of the command, so only the sizes are written. 

The last few entries in the description file are useful 
maintenance sequences. The "print" entry prints only the files 
changed since the last make print command. A zero-length 
file print is maintained to keep track of the time of the 
printing. The $? macro in the command line then picks up only 
the names of the files changed since print was touched. The 
printed output can be sent to a different printer or to a file by 
changing the definition of the P macro as follows: 

make print " P= cat >zap" 



DESCRIPTION FILES AND 
SUBSTITUTIONS 

A description file contains the following information: 



Comments 

The comment convention is that a sharp (#) and all 
characters on the same line after a sharp are ignored. 
Blank lines and lines beginning with a sharp (#) are totally 
ignored. If a noncomment line is too long, the line can be 
continued by using a backslash. If the last character of a 
line is a backslash, then the backslash, the new line, and all 
following blanks and tabs are replaced by a single blank. 

Macro definitions 

A macro definition is a line containing an equal sign not 
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preceded by a colon or a tab. The name (string of letters 
and digits) to the left of the equal sign (trailing blanks and 
tabs are stripped) is assigned the string of characters 
following the equal sign (leading blanks and tabs are 
stripped). The following are valid macro definitions: 

2 = xyz 

abc = -11 -ly -lid 

LIBES = 

The last definition assigns LIBES the null string. A macro 
that is never explicitly defined has the null string as the 
macro's value. 

Macro definitions may also appear on the make command 
line while other lines give information about target files. 
The general form of an entry is 

targetl [target2 . .] :[:] [dependentl . .] [; commands] 
[# . .] [(tab) commands] [#...] 



Items inside brackets may be omitted and targets and 
dependents are strings of letters, digits, periods, and 
slashes. Shell metacharacters such as "*" and "?" are 
expanded. Commands may appear either after a semicolon 
on a dependency line or on lines beginning with a tab 
immediately following a dependency line. A command is 
any string of characters not including a sharp (#) except 
when the sharp is in quotes or not including a new line. 

• Dependency information 
A dependency line may have either a single or a double 
colon. A target name may appear on more than one 
dependency line, but all of those lines must be of the same 
(single or double colon) type. For the usual single-colon 
case, a command sequence may be associated with at most 
one dependency line. If the target is out of date with any 
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of the dependents on any of the lines and a command 
sequence is specified (even a null one following a semicolon 
or tab), it is executed; otherwise, a default creation rule 
may be invoked. In the double-colon case, a command 
sequence may be associated with each dependency line; if 
the target is out of date with any of the files on a 
particular line, the associated commands are executed. A 
built-in rule may also be executed. This detailed form is of 
particular value in updating archive-type files. 

Executable commands 

If a target must be created, the sequence of commands is 
executed. Normally, each command line is printed and 
then passed to a separate invocation of the shell after 
substituting for macros. The printing is suppressed in the 
silent mode or if the command line begins with an @ sign. 
Make normally stops if any command signals an error by 
returning a nonzero error code. Errors are ignored if the 
— i flags have been specified on the make command line, if 
the fake target name ".IGNORE" appears in the 
description file, or if the command string in the description 
file begins with a hyphen. Some UNIX software commands 
return meaningless status. Because each command line is 
passed to a separate invocation of the shell, care must be 
taken with certain commands (e.g., cd and shell control 
commands) that have meaning only within a single shell 
process. These results are forgotten before the next line is 
executed. 

Before issuing any command, certain internally maintained 
macros are set. The $@ macro is set to the full target name 
of the current target. The $@ macro is evaluated only for 
explicitly named dependencies. The $? macro is set to the 
string of names that were found to be younger than the 
target. The $? macro is evaluated when explicit rules from 
the makefile are evaluated. If the command was generated 
by an implicit rule, the $< macro is the name of the related 
file that caused the action; and the $* macro is the prefix 
shared by the current and the dependent file names. If a 
file must be made but there are no explicit commands or 
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relevant built-in rules, the commands associated with the 
name ".DEFAULT" are used. If there is no such name, 
make prints a message and stops. 



EXTENSIONS OF $*, $@, AND $< 

The internally generated macros $*, $@, and $< are useful 
generic terms for current targets and out-of-date relatives. To 
this list has been added the following related macros: $(@D), 
$(@F), $(*D), $(*F), $(<D), and $(<F). The "D" refers to the 
directory part of the single letter macro. The "F" refers to the 
file name part of the single letter macro. These additions are 
useful when building hierarchical makefiles. They allow access 
to directory names for purposes of using the cd command of 
the shell. Thus, a shell command can be 

C.6 Sr^D^! SCMAKR^ ^(<F) 



The following command forces a complete rebuild of the 
operating system: 

FRC=FRC make -f 70.mk 

where the current directory is ucb. The FRC is a convention for 
i^o^Cing make to completely rebuild a target starting from 
scratch. 



OUTPUT TRANSLATIONS 

Macros in shell commands can now be translated when 
evaluated. The form is as follows: 



$(macro:stringl=string2) 
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The meaning of $(inacro) is evaluated. For each appearance of 
stringl in the evaluated macro, string2 is substituted. The 
meaning of finding stringl in $(macro) is that the evaluated 
$(macro) is considered as a bunch of strings each delimited by 
white space (blanks or tabs). Thus, the occurrence of stringl in 
$(macro) means that a regular expression of the following 
form has been found: 

.*<stringl>[TAB | BLANK] 

This particular form was chosen because make usually 
concerns itself with suffixes. A more general regular 
expression match could be implemented if the need arises. The 
usefulness of this type of translation occurs when maintaining 
archive libraries. Now, all that is necessary is to accumulate 
the out-of-date members and write a shell script which can 
handle all the C language programs (i.e., those files ending in 
".c"). Thus, the following fragment optimizes the executions of 
make for maintaining an archive library: 

$(LIB): $(LIB)(a.o) $(LIB)(b.o) $(LIB)c.o) 
$(CC) -c $(CFLAGS) $(?:.o=.c) 
ar rv $(LIB) $? 
rm$? 

A dependency of the preceding form is necessary for each of the 
different types of source files (suffices) which define the 
archive library. These translations are added in an effort to 
make more general use of the wealth of information which 
make generates. 
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COMMAND USAGE 

The make command takes macro definitions, flags, description 
file names, and target file names as arguments in the form: 

make [ flags ] [ macro definitions ] [ targets ] 

The following summary of command operations explains how 
these arguments are interpreted. 

First, all macro definition arguments (arguments with 
embedded equal signs) are analyzed and the assignments made. 
Command-line macros override corresponding definitions found 
in the description files. Next, the flag arguments are examined. 
The permissible flags are as follows: 

— i Ignore error codes returned by invoked 

commands. This mode is entered if the 
fake target name ".IGNORE" appears in 
the description file. 

— s Silent mode. Do not print command lines 

before executing. This mode is also 
entered if the fake target name 
".SILENT" appears in the description file. 

— r Do not use the built-in rules. 

— n No execute mode. Print commands, but do 

not execute them. Even lines beginning 
with an "@" sign are printed. 

— t Touch the target files (causing them to be 

up to date) rather than issue the usual 
commands. 

— q Question. The make command returns a 

zero or nonzero status code depending on 
whether the target file is or is not up to 
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— m 

-b 
-k 

.DEFAULT 



— e 



.PRECIOUS 



-d 



-f 



date. 

Print out the complete set of 
definitions and target descriptions. 



macro 



Print a memory map showing text, data, 
and stack. This option is a no-operation 
on systems without the getu system call. 

Compatibility mode for old makefiles. 

Abandon work on the current entry but 
continue on other branches that do not 
depend on the current entry. 

If a file must be made but there are no 
explicit commands or relevant built-in 
rules, the commands associated with the 
name DEFAULT are used if it exists. 

Environment variables override 

assignments within makefiles. 

Dependents on this target are not 
removed when quit or interrupt is pressed. 

Debug mode. Print out detailed 
information on files and times examined. 

Description file name. The next argument 
is assumed to be the name of a description 
file. A file name of "-" denotes the 
standard input. If there are no "-/" 
arguments, the file named makefile or 
Makefile in the current directory is read. 
The contents of the description files 
override the built-in rules if they are 
present. 
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Finally, the remaining arguments are assumed to be the names 
of targets to be made and the arguments are done in left-to- 
right order. If there are no such arguments, the first name in 
the description files that does not begin with a period is 
"made". 



THE ENVIRONMENT VARIABLES 

Environment variables are read and added to the macro 
definitions each time make executes. Precedence is a prime 
consideration in doing this properly. The following describes 
make's interaction with the environment. A new macro, 
MAKEFLAGS, is maintained by make. The new macro is 
defined as the collection of all input flag arguments into a 
string (without minus signs). The new macro is exported and 
thus accessible to further invocations of make. Command line 
flags and assignments in the makefile update MAKEFLAGS. 
Thus, to describe how the environment interacts with make, 
the MAKEFLAGS macro (environment variable) must be 
considered. 



When executed, make assigns macro definitions in the 
following order: 



1. Read the MAKEFLAGS environment variable. If it is 
not present or null, the internal make variable 
MAKEFLAGS is set to the null string. Otherwise, each 
letter in MAKEFLAGS is assumed to be an input flag 
argument and is processed as such. (The only exceptions 
are the — f, — p, and — r flags.) 

2. Read and set the input flags from the command line. 
The command line adds to the previous settings from the 
MAKEFLAGS environment variable. 

3. Read macro definitions from the command line. These 
are made not resettable. Thus, any further assignments to 
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these names are ignored. 

4. Read the internal list of macro definitions. These are 
found in the file rules.c of the source for make. Figure 
13-1 contains the complete makefile that represents the 
internally defined macros and rules of the current 
version of make. Thus, if make — r ... is typed and a 
makefile includes the makefile in Figure 13-1, the results 
would be identical to excluding the — r option and the 
include line in the makefile. The Figure 13-1 output can 
be reproduced by the following: 

make -fp - < /dev/null 2>/dev/null 

The output appears on the standard output. 

They give default definitions for the C language compiler 

(CC=cc), the assembler (AS=as), etc. 

5. Read the environment. The environment variables are 
treated as macro definitions and marked as exported (in 
the shell sense). However, since MAKEFLAGS* is not 
an internally defined variable (in rules.c), this has the 
effect of doing the same assignment twice. The exception 
to this is when MAKEFLAGS is assigned on the 
command line. (The reason it was read previously was to 
turn the debug flag on before anything else was done.) 

6. Read the makefile(s). The assignments in the makefile(s) 
overrides the environment. This order is chosen so that 
when a makefile is read and executed, you know what to 
expect. That is, you get what is seen unless the — e flag is 
used. The — e is an additional command line flag which 
tells make to have the environment override the 
makefile assignments. Thus, if make -e ... is typed, the 



MAKEFLAGS are read and set again. 
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variables in the environment override the definitions in 
the makefile*. Also MAKEFLAGS override the 
environment if assigned. This is useful for further 
invocations of make from the current makefile. 



LIST OF SUFFIXES 



.SUFFIXES: .0 .c .c* .y .y~ .1 .1" .s .s" 
.sh .sh" .h .h" 



# PRESET VARIABLES 

MAKE=make 

YACC=yacc 

YFLAGS= 

LEX=lex 

LFLAGS= 

LD=ld 

LDFLAGS= 

CC=cc 

CFLAGS=-0 

AS=as 

ASFLAGS= 

GET=get 

GFLAGS= 



Figure 13-1. Example of Internal Definitions (Sheet 1 
of 4) 



* There is no way to override the command line assignments. 
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.c: 
.c": 

.sh: 
.sh- 



.c.o: 
.c^.o: 



SINGLE SUFFIX RULES 



$(CC) $(CFLAGS) $(LDFLAGS) $< -o $@ 

$(GET) $(GFLAGS) -p $< > $*.c 
$(CC) $(CFAGS) $(LDFLAGS) $*.c $* 
-rm -f $*.c 

cp $< @;chmod 0777 $@ 

$(GET) &(GFLAGS) -p $< > **.sh 
cp $* .sh $*;chmod 0777 $@ 
-rm -f $* .sh 

DOUBLE SUFFIX RULES 



$(CC) $(CFLAGS) -c $< 



Figure 13-1. Example of Internal Definitions (Sheet 2 
of 4) 
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•c .c: 



•s.o: 



.s .0: 



•y.o: 



•y .0: 



.1.0: 



$(GET) $(CFLAGS) -p $< > $*.c 
$(CC) $(CFLAGS) -c $*.c 
-rm -f $*.c 

$(GET) $(GFLAGS) -p $< >$*.c 

$(AS) $(ASFLAGS) -o $@ $< 

$(GET) $(GFLAGS) -p $< > $*.s 
$(AS) $(ASFLAGS) -o $* .0 $* .s 
-rm -f $*.s 

$(YACC) $(YFLAGS) $< 
$(CC) $(CFLAGS) -c y.tab.c 
rm y.tab.c 



$(GET) $(GFLAG) -p $< > $*.y 
$(YACC) $(YFLAGS) $*.y 
$(CC) $(CFLAG) -c y.tab.c 
rm -f y.tab.c $*.y 
mv y.tab.o $*.o 

$(LEX) $(LFLAGS) $< 
$(CC) $(CFLAGS) -c lex.yy.c 
rm lex.yy.c 
mv lex.yy.o $@ 



Figure 13-1. Example of Internal Definitions (Sheet 3 
of 4) 
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.l-.o: 



•y -c: 



.l.c: 



.c.a: 



.c .a: 



.s .a: 



.h-.h 



$(GET) $(GFLAGS) -p $< > $*.l 
$(LEX) $(LFLAGS) $*.l 
$(CC) $(CFLAGS) -c lex.yy.c 
rm -f lex.yy.c $*.l 
mv lex.yy.o $*.o 

$(YACC) $(YFLAGS) $< 

mv y.tab.c $@ 

$(GET) $( GFLAGS) -p $< > $*.y 
$(YACC) $(YFLAGS) $*.y 
-rm -f $*.y 



$(LEX) $< 
mv lex.yy.c $@ 

$(CC) -c $(CFLAGS) $< 
ar rv $@ $*.o 
rm -f $*.o 

$(GET) $(GFLAGS) -p $< > $*.c 
$(CC) -c $(CFLAGS) $*.c 
ar rv $@ $*.o 



$(GET) $(GFLAGS) -p $< > $*.s 
$(AS) $(ASFLAGS) -o $*.o $*.s 
ar rv $@ $*.o 
-rm -f $*.[so] 

$(GET) $(GFLAGS) -p $< > $*.h 



Figure 13-1. Example of Internal Definitions (Sheet 4 
of 4) 
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It may be clearer to list the precedence of assignments. Thus, 
in order from least binding to most binding, the precedence of 
assignments is as follows: 



1. internal definitions (from rules.c) 

2. environment 

3. makefile(s) 

4. command line. 

The — e flag has the effect of changing the order to: 

1. internal definitions (from rules.c) 

2. makefile(s) 

3. environment 

4. command line. 

This order is general enough to allow a programmer to define a 
makefile or set of makefiles whose parameters are dynamically 
definable. 



RECURSIVE MAKEFILES 

Another feature was added to make concerning the 
environment and recursive invocations. If the sequence 
"$(MAKE)" appears anywhere in a shell command line, the line 
is executed even if the — n flag is set. Since the — n flag is 
exported across invocations of make (through the 
MAKEFLAGS variable), the only thing that actually gets 
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executed is the make command itself. This feature is useful 
when a hierarchy of makefile(s) describes a set of software 
subsystems. For testing purposes, make — n ... can be 
executed and everything that would have been done will get 
printed out including output from lower level invocations of 
make. 



SUFFIXES AND TRANSFORMATION RULES 

The make program does not know what file name suffixes are 
interesting or how to transform a file with one suffix into a file 
with another suffix. This information is stored in an internal 
table that has the form of a description file. If the — r flag is 
used, the internal table is not used. 



The list of suffixes is actually the dependency list for the name 
".SUFFIXES". The make program searches for a file with any 
of the suffixes on the list. If such a file exists and if there is a 
transformation rule for that combination, make transforms a 
file with one suffix into a file with another suffix. The 
transformation rule names are the concatenation of the two 
suffixes. The name of the rule to transform a .r file to a .0 file 
is thus .r.o. If the rule is present and no explicit command 
sequence has been given in the user's description files, the 
command sequence for the rule .r.o is used. If a command is 
generated by using one of these suffixing rules, the macro $* is 
given the value of the stem (everything but the suffix) of the 
name of the file to be made; and the macro $< is the name of 
the dependent that caused the action. 

The order of the suffix list is significant since the list is 
scanned from left to right. The first name formed that has both 
a file and a rule associated with it is used. If new names are to 
be appended, the user can add an entry for ".SUFFIXES" in his 
own description file. The dependents are added to the usual list. 
A ".SUFFIXES" line without any dependents deletes the 
current list. It is necessary to clear the current list if the order 
of names is to be changed. The following is an excerpt from 
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the default rules file: 

.SUFFIXES : .0 .c .e .r I .y .yr .ye .1 .s 

YACC = yacc 

YACCR = yacc -r 

YACCE = yacc -e 

YFLAGS = 

LEX = lex 

LFLAGS = 

CC = cc 

AS = as - 

CFLAGS = 

RC = ec 

RFLAGS = 

EC -ec 

EFLAGS = 

FFlags = 

.CO : 

$(CC) $(CFLAGS) -c $< 
.e.o .r.o .f.o : 

$(EC) $(RFLAGS) $(EFLAGS) $(FFLAGS) -c $< 



.s.o 



•y.o 



•y.c 



$(AS) -0 $@ $< 

$(YACC) $(YFLAGS) $< 
$(CC) $(CFLAGS) -c y.tab.c 
rm y.tab.c 
mv y.tab.o $@ 

$(YACC) $(YFLAGS) $< 
mv y.tab.c $@ 
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IMPLICIT RULES 

The make program uses a table of interesting suffixes and a 
set of transformation rules to supply default dependency 
information and implied commands. The default suffix list is 
as follows: 



.0 Object file 

.o~ sees Object file 

.c C source file 

.c~ sees e source file 

.s Assembler source file 

.s" sees Assembler source file 

.y Yacc-O source grammar 

.y" sees Yacc C source grammar 

.h Header file 

.h' sees Header file 

.sh Shell file 

.sh- sees Shell file 

.1 Lex source grammar, 

.r sees Lex source grammar. 

Figure 13-2 summarizes the default transformation paths. If 
there are two paths connecting a pair of suffixes, the longer one 
is used only if the intermediate file exists or is named in the 
description. 
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.y .yr .ye .1 



.y .1 .yr .ye 

Figure 13-2. Summary of Default Transformation Path 

If the file x.o were needed and there were an x.c in the 
description or directory, the x.o file would be compiled. If there 
were also an x.l, that grammar would be run through Lex 
before compiling the result. However, if there were no x.c but 
there were an x.l, make would discard the intermediate C 
language file and use the direct link as shown in Figure 13-3. 

It is possible to change the names of some of the compilers used 
in the default or the flag arguments with which they are 
invoked by knowing the macro names used. The compiler 
names are the macros AS, CC, YACC and LEX. The 
command 

make CC==newcc 

will cause the newcc command to be used instead of the usual 
C language compiler. The macros CFLAGS, RFLAGS, 
EFLAGS, YFLAGS, and LFLAGS may be set to cause these 
commands to be issued with optional flags. Thus 



13-26 



MAKE 



make " CFLAGS=-0" 
causes the optimizing C language compiler to be used. 



FORMAT OF SHELL COMMANDS 
WITHIN make 

The make program remembers embedded newlines and tabs in 
shell command sequences. Thus, if the programmer puts a for 
loop in the makefile with indentation, when make prints it out, 
it retains the indentation and backslashes. The output can still 
be piped to the shell and is readable. This is obviously a 
cosmetic change; no new function is gained. 



ARCHIVE LIBRARIES 

The make program has an improved interface to archive 
libraries. Due to a lack of documentation, most people are 
probably not aware of the current syntax of addressing 
members of archive libraries. The previous version of make 
allows a user to name a member of a library in the following 
manner: 



lib(object.o) 

or 
lib((_localtime)) 

where the second method actually refers to an entry point of an 
object file within the library. (Make looks through the 
library, locates the entry point, and translates it to the correct 
object file name.) 

To use this procedure to maintain an archive library, the 
following type of makefile is required: 
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lib:: lib(ctime.o) 

$(CC) -c -0 ctime.c 

ar rv lib ctime.o 

rm ctime.o 
lib:: lib(fopen.o) 

$(CC) -c -0 fopen.c 

ar rv lib fopen.o 

rm fopen.o 
. . .and so on for each object . . . 

This is tedious and error prone. Obviously, the command 
sequences for adding a C language file to a library are the same 
for each invocation; the file name being the only difference each 
time. (This is true in most cases.) 

The current version gives the user access to a rule for building 
libraries. The handle for the rule is the ".a" suffix. Thus, a 
".c.a" rule is the rule for compiling a C language source file, 
adding it to the library, and removing the ".o" cadaver. 
Similarly, the ".y.a", the ".s.a", and the ".La" rules rebuild 
YACC, assembler, and LEX files, respectively. The current 
archive rules defined internally are ".c.a", ".c^.a", and ".s~.a". 
[The tilde (') syntax will be described shortly.l The user may 
define in makefile other rules needed. 

The above 2-member library is then maintained with the 
following shorter makefile: 

lib: lib(ctime.o) lib(fopen.o) 

echo lib up-to-date. 

The internal rules are already defined to complete the 
preceding library maintenance. The actual ".c.a" rules are as 
follows: 
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•c.a: 

$(CC) -c $(CFLAGS) $< 
ar rv $@ $*.o 
rm -f $*.o 

Thus, the $@ macro is the ".a" target (lib); the $< and $* 
macros are set to the out-of-date C language file; and the file 
name scans the suffix, respectively (ctime.c and ctime). The $< 
macro (in the preceding rule) could have been changed to $*.c. 

It might be useful to go into some detail about exactly what 
make does when it sees the construction 

lib: lib(ctime.o) 

@echo lib up-to-date 

Assume the object in the library is out of date with respect to 
ctime.c. Also, there is no ctime.o file. 



1. Do lib. 

2. To do lib, do each dependent of lib. 

3. Do lib(ctime.o). 

4. To do lib{ctime.o), do each dependent of lib(ctiine.o). 
(There are none.) 

5. Use internal rules to try to build lib{ctime.o). (There is 
no explicit rule.) Note that lib(ctime.o) has a parenthesis 
in the name to identify the target suffix as ".a". This is 
the key. There is no explicit ".a" at the end of the lib 
library name. The parenthesis forces the ".a" suffix. In 
this sense, the ",a" is hard wired into make. 

6. Break the name lib (ctime.o) up into lib and ctime.o. 
Define two macros, $@ {=lib) and $* (^ctime). 
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7. Look for a rule ".X.a" and a file $*.X. The first ".X" (in 
the .SUFFIXES list) which fulfills these conditions is 
".c" so the rule is ".c.a", and the file is ctime.c. Set $< to 
be ctime.c and execute the rule. In fact, make must 
then do ctime.c. However, the search of the current 
directory yields no other candidates, and the search ends. 

8. The library has been updated. Do the rule associated 
with the "lib:" dependency; namely 

echo lib up-to-date 

It should be noted that to let ctime.o have dependencies, the 
following syntax is required: 

lib(ctime.o): $(INCDIR)/stdio.h 

Thus, explicit references to .o files are unnecessary. There is 
also a new macro for referencing the archive member name 
when this form is used. The $% macro is evaluated each time 
$@ is evaluated. If there is no current archive member, $% is 
null. If an archive member exists, then $% evaluates to the 
expression between the parenthesis. 

An example makefile for a larger library is given in Figure 13- 
3. 
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# @(#)/usr/src/cmd/make/make.tm 3.2 
LIB =lsxlib 

PR=lp 

INSDIR = /rl/flopO/ 

INS = eval 

Isx: $(LIB) low.o mch.o 

Id -X low.o mch.o $(LIB) 

mv a.out Isx 

@size Isx 

# Here, $(INS) as either " ." or " eval" . 
Isx: 

$(INS)'cp Isx $(INSDIR)lsx . . 
strip $(INSDIR)lsx . . 
Is -1 $(INSDIR)lsx' 

$(PR) header.slow.smch.s*.h*.c Makefile 



print: 



Figure 13-3. Example of Library Makefile (Sheet 1 of 
3) 



13-31 



MAKE 



$(LIB): 



$(LIB) 


(CL0CK.0 


$(LIB) 


(main.o) 


$(LIB) 


(tty.o) 


$(LIB) 


(trap.o) 


$(LIB) 


(sysent.o) 


$(LIB) 


(sys2.o) 


$(LIB) 


(synS.o) 


$(LIB) 


(syn4.o) 


$(LIB) 


(sysl.o) 


$(LIB) 


(sig.o) 


$(LIB) 


(fio.o) 


$(LIB) 


(kl.o) 


$(LIB) 


(alloc.o) 


$(LIB) 


(nami.o) 


$(LIB) 


(iget.o) 


$(LIB) 


(rdwri.o) 


$(LIB) 


(subr.o) 



Figure 13-3. Example of Library Makefile (Sheet 2 of 
3) 
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.s.o: 



,o.a: 



.s.a: 



$(LIB)(bio.o) 

$(LIB)(decfd.o) 

$(LIB)(sip.o) 

$(LIB)(space.o) 

$(LIB)(puts.o) 

@echo $(LIB) now up to date. 

as -0 $*.o header.s $*.s 

ar rv $@ $< 
rm -f $< 



as -0 $*.o header.s $*.s 
ar rv $@ $*.o 
rm -f $*.o 
.PRECIOUS:$(LIB) 



Figure 13-3. Example of Library Makefile (Sheet 3 of 
3) 



The reader will note also that there are no lingering "*.o" files 
left around. The result is a library maintained directly from 
the source files (or more generally from the SCCS files). 



SOURCE CODE CONTROL SYSTEM FILE 
NAMES: THE TILDE 

The syntax of make does not directly permit referencing of 
prefixes. For most types of files on UNIX operating system 
machines, this is acceptable since nearly everyone uses a suffix 
to distinguish different types of files. The SCCS files are the 
exception. Here, "s." precedes the file name part of the 
complete pathname. 

13-33 



MAKE 



To allow make easy access to the prefix "s." requires either a 
redefinition of the rule naming syntax of make or a trick. The 
trick is to use the tilde (~) as an identifier of SCCS files. 
Hence, ".c-.o" refers to the rule which transforms an SCCS C 
language source file into an object. Specifically, the internal 
rule is 



.c .o: 

$(GET) $(GFLAGS) -p $< > $*.c 
$(CC) $(CFLAGS) -c $*.c 
-rm -f $*.c 

Thus, the tilde appended to any suffix transforms the file 
search into an SCCS file name search with the actual suffix 
named by the dot and all characters up to (but not including) 
the tilde. 

The following SCCS suffixes are internally defined: 



.c 

•y~ 

.s" 

.sh" 

.h~ 



The following rules involving SCCS transformations are 
internally defined: 
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.c : 

.sh~: 
.c~.o: 
.s'.o: 
•y'.o: 
.r.o: 
.y~.c: 
.c-.a: 
.s^.a: 
.h~.h: 



Obviously, the user can define other rules and suffixes which 
may prove useful. The tilde gives him a handle on the SCCS 
file name format so that this is possible. 



THE NULL SUFFIX 

In the UNIX system source code, there are many commands 
which consist of a single source file. It was wasteful to 
maintain an object of such files for make. The current 
implementation supports single suffix rules (a null suffix). 
Thus, to maintain the program cat, a rule in the makefile of 
the following form is needed: 



.c: 

$(CC) -n -0 $< -o $@ 

In fact, this ".c:" rule is internally defined so no makefile is 
necessary at all. The user only needs to type 

make cat dd echo date 

(these are notable single file programs) and all four C language 
source files are passed through the above shell command line 
associated with the ".c:" rule. The internally defined single 
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suffix rules are 



•c: 
.c": 
.sh: 
.sr: 



Others may be added in the makefile by the user. 



INCLUDE FILES 

The make program has an include file capability. If the string 
include appears as the first seven letters of a line in a makefile 
and is followed by a blank or a tab, the string is assumed to be 
a file name which the current invocation of make will read. 
The file descriptors are stacked for reading include files so that 
no more than about 16 levels of nested includes are supported. 



INVISIBLE sees MAKEFILES 

The sees makefiles are invisible to make. That is, if make 
is typed and only a file named s.makefile exists, make will do 
a get on the file, then read and remove the file. Using the — f , 
make will get, read, and remove arguments and include files. 



DYNAMie DEPENDENeY PARAMETERS 

A new dependency parameter has been defined. The parameter 
has meaning only on the dependency line in a makefile. The 
$$@ refers to the current "thing" to the left of the colon (which 
is $@). Also the form $$(@F) exists which allows access to the 
file part of $@. Thus, in the following: 
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cat: $$@.c 

the dependency is translated at execution time to the string 
"cat.c". This is useful for building a large number of executable 
files, each of which has only one source file. For instance, the 
UNIX software command directory could have a makefile like: 

CMDS = cat dd echo date cc cmp comm ar Id chown 

$(CMDS): $$@.c 

$(CC) -0 $? -o $@ 

Obviously, this is a subset of all the single file programs. For 
multiple file programs, a directory is usually allocated and a 
separate makefile is made. For any particular file that has a 
peculiar compilation procedure, a specific entry must be made 
in the makefile. 

The second useful form of the dependency parameter is $$(@F). 
It represents the file name part of $$@. Again, it is evaluated 
at execution time. Its usefulness becomes evident when trying 
to maintain the /usr/include directory from a makefile in the 
/usr/src/head directory. Thus, the /usr/src/head/makefile 
would look like 

INCDIR = /usr/include 

INCLUDES = \ 

$(INCDIR)/stdio.h \ 
$(INCDIR)/pwd.h \ 
$(INCIDR)/dir.h \ 
$(INCDIR)/a.out.h 

$(INCLUDES): $$(@F) 
cp $? $@ 
chmod 0444 $@ 
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This would completely maintain the /usr/include directory 
whenever one of the above files in /usr/src/head was updated. 



SUGGESTIONS AND WARNINGS 

The most common difficulties arise from make's specific 
meaning of dependency. If file x.c has a "#include " defs" " line, 
then the object file x.o depends on defs; the source file x.c does 
not. If defs is changed, nothing is done to the file x.c while file 
x.o must be recreated. 



To discover what make would do, the — n option is very useful. 
The command 



make -n 

orders make to print out the commands which make would 
issue without actually taking the time to execute them. If a 
change to a file is absolutely certain to be mild in character 
(e.g., adding a new definition to an include file), the — t (touch) 
option can save a lot of time. Instead of issuing a large number 
of superfluous recompilations, make updates the modification 
times on the affected file. Thus, the command 

make -ts 

("touch silently") causes the relevant files to appear up to date. 
Obvious care is necessary since this mode of operation subverts 
the intention of make and destroys all memory of the previous 
relationships. 

The debugging flag (— d) causes make to print out a very 
detailed description of what it is doing including the file times. 
The output is verbose and recommended only as a last resort. 



13-38 



Chapter 14 

SOURCE CODE CONTROL SYSTEM USER 

GUIDE 



PAGE 

GENERAL 14-1 

sees FOR BEGINNERS 14-3 

DELTA NUMBERING 14-10 

sees eOMMAND eONVENTIONS 14-15 

sees eoMMANDS 14-17 

sees FILES 14-52 

AN sees INTERFAeE PROGRAM 14-57 



Chapter 14 

SOURCE CODE CONTROL SYSTEM 
USER GUIDE 



GENERAL 

The Source Code Control System (SCCS) is a collection of the 
UNIX software commands that help individuals or projects 
control and account for changes to files of text. The source code 
and documentation of software systems are typical examples of 
files of text to be changed. SCCS is a collection of programs 
that run under the UNIX operating system. It is convenient to 
conceive of SCCS as a custodian of files. The SCCS provides 
facilities for 



• Storing files of text 

• Retrieving particular versions of the files 

• Controlling updating privileges to files 

• Identifying the version of a retrieved file 

• Recording when, where, and why the change was made and 
who made each change to a file. 

These types of facilities are important when programs and 
documentation undergo frequent changes because of 
maintenance and/or enhancement work. It is often desirable to 
regenerate the version of a program or document as it existed 
before changes were applied to it. This can be done by keeping 
copies (on paper or other media), but this method quickly 
becomes unmanageable and wasteful as the number of 
programs and documents increases. SCCS provides an 
attractive solution because the original file is stored on disk. 
Whenever changes are made to the file, SCCS adds only the 
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changes to the file. The tracking information is also maintained 
as part of the same file. Each set of changes is called a "delta". 

This chapter, together with relevant portions of the AT&T 
UNIX PC UNIX System V Manual is a complete user's guide to 
sees. The following topics are covered: 

• sees for Beginners: How to make an SeeS file, how to 
update it, and how to retrieve a version thereof. 

• How Deltas Are Numbered: How versions of SeeS files are 
numbered and named. 

• sees eommand eonventions: eonventions and rules 
generally applicable to all SeeS commands. 

• sees eommands: Explanation of all SeeS commands with 
discussions of the more useful arguments. 

• sees Files: Protection, format, and auditing of SeeS files 
including a discussion of the differences between using 
sees as an individual and using it as a member of a group 
or project. The role of a "project SeeS administrator" is 
introduced. 

Neither the implementation of SeeS nor the installation 
procedure for SeeS is described in this section. 

Throughout this section, each reference of the form name (IM), 
name (7), or name (8) refers to entries in the AT&T UNIX PC 
UNIX System V Manual. All other references to entries of the 
form name(N), where "N" is a number (1 through 5) possibly 
followed by a letter, refer to entry name in section N of the 
AT&TVNIX PC UNIX System V Manual 
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sees FOR BEGINNERS 

It is assumed that the reader knows how to log onto a UNIX 
system, create files, and use the text editor. A number of 
terminal-session fragments are presented. All of them should 
be tried since the best way to learn SCCS is to use it. 

To supplement the material in this section, the detailed SCCS 
command descriptions in the AT&T UNIX PC UNIX System V 
Manual should be consulted. 



A. Terminology 

Each SCCS file is composed of one or more sets of changes 
applied to the null (empty) version of the file, with each set of 
changes usually depending on all previous sets. Each set of 
changes is called a "delta" and is assigned a name, called the 
SCCS /Dentification string (SID). The SID is composed of at 
most four components. The first two components are the 
"release" and "level" numbers which are separated by a period. 
Hence, the first delta (for the original file) is called "1.1", the 
second "1.2", the third "1.3", etc. The release number can also 
be changed allowing, for example, deltas "2.1", "3.1", etc. The 
change in the release number usually indicates a major change 
to the file. 

Each delta of an SCCS file defines a particular version of the 
file. For example, delta 1.5 defines version 1.5 of the SCCS file, 
obtained by applying to the null (empty) version of the file the 
changes that constitute deltas 1.1, 1.2, etc., up to and including 
delta 1.5 itself, in that order. 
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B. ereating an SeCS File via "admin" 

Consider, for example, a file called lang that contains a list of 
programming languages. 



c 

pl/i 

fortran 

cobol 

algol 

Custody of the lang file can be given to SCCS. The following 
admin command (used to "administer" SCCS files) creates an 
SCCS file and initializes delta 1.1 from the file lang: 

admin -ilang s.lang 

All SCCS files must have names that begin with "s.", hence, 
s.lang. The — i keyletter, together with its value lang, indicates 
that admin is to create a new SCCS file and "initialize" the 
new SCCS file with the contents of the file lang. This initial 
version is a set of changes (delta 1.1) applied to the null SCCS 
file. 

The admin command replies 

No id keywords (cm7) 

This is a warning message (which may also be issued by other 
SCCS commands) that is to be ignored for the purposes of this 
section. Its significance is described under the get command in 
the section "SCCS COMMANDS." In the following examples, 
this warning message is not shown although it may actually be 
issued by the various commands. The file lang should now be 
removed (because it can be easily reconstructed using the get 
command) as follows: 
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rm lang 



e. Retrieving a File via "get" 

The lang file can be reconstructed by using the following get 
command: 



get s.lang 

The command causes the creation (retrieval) of the latest 
version of file s.lang and prints the following messages: 

1.1 
5 lines 



This means that get retrieved version 1.1 of the file, which is 
made up of five lines of text. The retrieved text is placed in a 
file whose name is formed by deleting the "s." prefix from the 
name of the SCCS file. Hence, the file lang is created. 

The "get s.lang" command simply creates the file lang (read- 
only) and keeps no information regarding its creation. On the 
other hand, in order to be able to subsequently apply changes to 
an SCCS file with the delta command, the get command must 
be informed of your intention to do so. This is done as follows: 

get -e s.lang 

The — e keyletter causes get to create a file lang for both 
reading and writing (so it may be edited) and places certain 
information about the SCCS file in another new file. The new 
file, called the p-file, will be read by the delta command. The 
get command prints the same messages as before except that 
the SID of the version to be created through the use of delta is 
also issued. For example, 
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get -e s.lang 

1.1 

new delta 1.2 

5 lines 

The file lang may now be changed, for example, by 

ed lang 

27 

$a 

snobol 

ratfor 

w 
41 

q 



D. Recording ehanges via "delta'* 

In order to record within the SCCS file the changes that have 
been applied to lang, execute the following command: 



delta s.lang 

Delta prompts with 

comments? 

The response should be a description of why the changes were 
made. For example, 

comments? added more languages 

The delta command then reads the p-file and determines what 
changes were made to the file lan^. The delta command does 
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this by doing its own get to retrieve the original version and by 
applying the diff(l) command to the original version and the 
edited version. 



When this process is complete, at which point the changes to 
lang have been stored in s.lang, delta outputs 

1.2 

2 inserted 
deleted 
5 unchanged 

The number "1.2" is the name of the delta just created, and the 
next three lines of output refer to the number of lines in the 
file s.lang. 



E. Additional Information About "get** 

As shown in the previous example, the command 

get s.lang 

retrieves the latest version (now 1.2) of the file s.lang. This is 
done by starting with the original version of the file and 
successively applying deltas (the changes) in order until all 
have been applied. 
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In the example chosen, the following commands are all 
equivalent: 

get s.lang 
get -rl s.lang 
get -rl.2 s.lang 

The numbers following the — r keyletter are SIDs. Note that 
omitting the level number of the SID (as in "get -rl s.lang") is 
equivalent to specifying the highest level number that exists 
within the specified release. Thus, the second command 
requests the retrieval of the latest version in release 1, namely 
1.2. The third command specifically requests the retrieval of a 
particular version, in this case, also 1.2. 

Whenever a truly major change is made to a file, the 
significance of that change is usually indicated by changing the 
release number (first component of the SID) of the delta being 
made. Since normal automatic numbering of deltas proceeds by 
incrementing the level number (second component of the SID), 
the user must indicate to SCCS the need to change the release 
number. This is done with the get command. 

get -e -r2 s.lang 

Because release 2 does not exist, get retrieves the latest version 
before release 2. The get command also interprets this as a 
request to change the release number of the delta which the 
user desires to create to 2, thereby causing it to be named 2.1, 
rather than 1.3. This information is conveyed to delta via the 
p-file. The get command then outputs 

1.2 

new delta 2.1 

7 lines 
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which indicates that version 1.2 has been retrieved and that 2.1 
is the version delta will create. If the file is now edited, for 
example, by 

ed lang 

41 

/cobol/d 

w 

35 

q 

and delta executed 

delta s.lang 

comments? deleted cobol from list of languages 

the user will see by delta's output that version 2.1 is indeed 
created. 

2.1 

inserted 

1 deleted 

6 unchanged 

Deltas may now be created in release 2 (deltas 2.2, 2.3, etc.), or 
another new release may be created in a similar manner. This 
process may be continued as desired. 



F. The "help" eommand 

If the command 

get abc 

14-9 



sees 

is executed, the following message will be output: 

ERROR [abc]: not an SCCS file (col) 

The string "col" is a code for the diagnostic message and may 
be used to obtain a fuller explanation of that message by use of 
the help command. 

help col 

This produces the following output: 

col: 

" not an SCCS file" 

A file that you think is an SCCS file 

does not begin with the characters " s." . 

Thus, help is a useful command to use whenever there is any 
doubt about the meaning of an SCCS message. Detailed 
explanations of almost all SCCS messages may be found in this 
manner. 



DELTA NUMBERING 

It is convenient to conceive of the deltas applied to an SCCS file 
as the nodes of a tree in which the root is the initial version of 
the file. The root delta (node) is normally named "1.1" and 
successor deltas (nodes) are named "1.2", "1.3", etc. The 
components of the names of the deltas are called the "release" 
and the "level" numbers, respectively. Thus, normal naming of 
successor deltas proceeds by incrementing the level number, 
which is performed automatically by SCCS whenever a delta is 
made. In addition, the user may wish to change the release 
number when making a delta to indicate that a major change is 
being made. When this is done, the release number also applies 
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to all successor deltas unless specifically changed again. Thus, 
the evolution of a particular file may be represented as in 
Figure 14-1. 

o — o — o — o-k^ — o 

1.1 1.2 1.3 1.4 I 2.1 2.2 

RELEASE 1 I RELEASE 2 

Figure 14-1. Evolution of an SeCS File 

Such a structure may be termed the "trunk" of the SCCS tree. 
Figure 14-1 represents the normal sequential development of an 
SCCS file in which changes that are part of any given delta are 
dependent upon all the preceding deltas. 

However, there are situations in which it is necessary to cause 
a branching in the tree in that changes applied as part of a 
given delta are not dependent upon all previous deltas. As an 
example, consider a program which is in production use at 
version 1.3 and for which development work on release 2 is 
already in progress. Thus, release 2 may already have some 
deltas precisely as shown in Figure 14-1. Assume that a 
production user reports a problem in version 1.3 and that the 
nature of the problem is such that it cannot wait to be repaired 
in release 2. The changes necessary to repair the trouble will 
be applied as a delta to version 1.3 (the version in production 
use). This creates a new version that will then be released to 
the user but will not affect the changes being applied for 
release 2 (i.e., deltas 1.4, 2.1, 2.2, etc.). 

The new delta is a node on a branch of the tree. Its name 
consists of four components; the release number and the level 
number (as with trunk deltas) plus the "branch" number and 
the "sequence" number. The delta name appears as follows: 

release.level.branch.sequence 
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The branch number is assigned to each branch that is a 
descendant of a particular trunk delta with the first such 
branch being 1, the next one 2, etc. The sequence number is 
assigned, in order, to each delta on a particular branch. Thus, 
1.3.1.2 identifies the second delta of the first branch that 
derives from delta 1.3. This is shown in Figure 14-2, 



1.3.1.2 



BRANCH 1 



o — o 




1.1 1.2 1.3 1.4 2.1 2.2 

Figure 14-2. Tree Structure With Branch Deltas 



The concept of branching may be extended to any delta in the 
tree. The naming of the resulting deltas proceeds in the 
manner just illustrated. 



Two observations are of importance with regard to naming 
deltas. First, the names of trunk deltas contain exactly two 
components, and the names of branch deltas contain exactly 
four components. Second, the first two components of the name 
of branch deltas are always those of the ancestral trunk delta, 
and the branch component is assigned in the order of creation 
of the branch independently of its location relative to the trunk 
delta. Thus, a branch delta may always be identified as such 
from its name. Although the ancestral trunk delta may be 
identified from the branch delta's name, it is not possible to 
determine the entire path leading from the trunk delta to the 
branch delta. For example, if delta 1.3 has one branch 
emanating from it, all deltas on that branch will be named 
1.3.1. n. If a delta on this branch then has another branch 
emanating from it, all deltas on the new branch will be named 
1.3.2.n (see Figure 14-3) The only information that may be 
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derived from the name of delta 1.3.2,2 is that it is the 
chronologically second delta on the chronologically second 
branch whose trunk ancestor is delta 1.3. In particular, it is not 
possible to determine from the name of delta 1.3.2.2 all the 
deltas between it and trunk ancestor 1.3. 
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O — O 



1.1 



BRANCH 1 




o 



1.3.2.2 



o — o — o 



1.2 1.3 



1.4 



2.1 2.2 



Figure 14-3. Extending the Branching eoncept 
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It is obvious that the concept of branch deltas allows the 
generation of arbitrarily complex tree structures. Although 
this capability has been provided for certain specialized uses, it 
is strongly recommended that the SCCS tree be kept as simple 
as possible because comprehension of its structure becomes 
extremely difficult as the tree becomes more complex. 



SCCS COMMAND CONVENTIONS 

This part discusses the conventions and rules that apply to 
SCCS commands. These rules and conventions are generally 
applicable to all SCCS commands with exceptions indicated. 
The SCCS commands accept two types of arguments: 



• Keyletter arguments 

• File arguments. 

Keyletter arguments (hereafter called simply "keyletters") 
begin with a minus sign (-), followed by a lowercase alphabetic 
character, and in some cases, followed by a value. These 
keyletters control the execution of the command to which they 
are supplied. 

File arguments (names of files and/or directories) specify the 
file(s) that the given SCCS command is to process. Naming a 
directory is equivalent to naming all the SCCS files within the 
directory. Non-SCCS files and unreadable files [because of 
permission modes via chmod(l)] in the named directories are 
silently ignored. 

In general, file arguments may not begin with a minus sign. 
However, if the name "-" (a lone minus sign) is specified as an 
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argument to a command, the command reads the standard 
input for lines and takes each line as the name of an SCCS file 
to be processed. The standard input is read until end-of-file. 
This feature is often used in pipelines with, for example, the 
find(l) or ls(l) commands. Again, names of non-SCCS files 
and of unreadable files are silently ignored. 

All keyletters specified for a given command apply to all file 
arguments of that command. All keyletters are processed 
before any file arguments with the result that the placement of 
keyletters is arbitrary (i.e., keyletters may be interspersed with 
file arguments). File arguments, however, are processed left to 
right. Somewhat different argument conventions apply to the 
help, what, sccsdiff, and val commands. 

Certain actions of various SCCS commands are controlled by 
flags appearing in SCCS files. Some of these flags are 
discussed in this part. For a complete description of all such 
flags, see admin(l) section in the AT&T UNIX PC UNIX 
System V Manual. 

The distinction between the real user [see passwd(l)] and the 
effective user of a UNIX system is of concern in discussing 
various actions of SCCS commands. For the present, it is 
assumed that both the real user and the effective user are one 
and the same (i.e., the user who is logged into a UNIX system). 
This subject is discussed further in "SCCS FILES." 

The balance of this section does not discuss command 
conventions, it covers temporary files generated by SCCS. 

All SCCS commands that modify an SCCS file do so by writing 
a temporary copy, called the x-file. This file ensures that the 
SCCS file is not damaged if processing should terminate 
abnormally. The name of the x-file is formed by replacing the 
"s." of the SCCS file name with "x.". When processing is 
complete, the old SCCS file is removed and the x-file is 
renamed to be the SCCS file. The x-file is created in the 
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directory containing the SCCS file, given the same mode [see 
chmod(l)] as the SCCS file, and owned by the effective user. 

To prevent simultaneous updates to an SCCS file, commands 
that modify SCCS files create a lock-file, called the z-file, whose 
name is formed by replacing the "s." of the SCCS file name 
with "z.". The z-file contains the process number of the 
command that creates it, and its existence is an indication to 
other commands that the SCCS file is being updated. Thus, 
other commands that modify SCCS files do not process an 
SCCS file if the corresponding z-file exists. The z-file is created 
with mode 444 (read-only) in the directory containing the SCCS 
file and is owned by the effective user. This file exists only for 
the duration of the execution of the command that creates it. 
In general, users can ignore x-files and z-files. The files may be 
useful in the event of system crashes or similar situations. 

The SCCS commands produce diagnostics (on the diagnostic 
output) of the form: 

ERROR [name-of -file-being-processed]: message text (code) 

The code in parentheses may be used as an argument to the 
help command to obtain a further explanation of the 
diagnostic message. Detection of a fatal error during the 
processing of a file causes the SCCS command to terminate 
processing of that file and to proceed with the next file, in 
order, if more than one file has been named. 



SCCS COMMANDS 

This part describes the major features of all the SCCS 
commands. Detailed descriptions of the commands and of all 
their arguments are given in the AT&T UNIX PC UNIX 
System V Mnaual and should be consulted for further 
information. The discussion below covers only the more 
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common arguments of the various SCCS commands. 

The commands follow in approximate order of importance. The 
following is a summary of all the SCCS commands and of their 
major functions: 



get 
delta 

admin 

prs 

help 
rmdel 



Retrieves versions of SCCS files. 

Applies changes (deltas) to the text of 
SCCS files, i.e., creates new versions. 

Creates SCCS files and applies changes to 
parameters of SCCS files. 

Prints portions of an SCCS file in user 
specified format. 

Gives explanations of diagnostic messages. 

Removes a delta from an SCCS file; allows 
the removal of deltas that were created by 
mistake. 



cdc 



Changes the commentary associated with 
a delta. 



what 



sccsdiff 



Searches any UNIX system file(s) for all 
occurrences of a special pattern and prints 
out what follows it; is useful in finding 
identifying information expanded by the 
get command. 

Shows the differences between any two 
versions of an SCCS file. 



comb 



Combines two or more consecutive deltas 
of an SCCS file into a single delta; often 
reduces the size of the SCCS file. 
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val Validates an SCCS file. 



A. The "get'* eommand 

The get command creates a text file that contains a particular 
version of an SCCS file. The particular version is retrieved by 
beginning with the initial version and then applying deltas, in 
order, until the desired version is obtained. The created file is 
called the g-file. The g-file name is formed by removing the 
"s." from the SCCS file name. The g-file is created in the 
current directory and is owned by the real user. The mode 
assigned to the g-file depends on how the get command is 
invoked. 



The most common invocation of get is 

get s.abc 

which normally retrieves the latest version on the trunk of the 
SCCS file tree and produces (for example) on the standard 
output 

1.3 

67 lines 

No id keywords (cm7) 

which indicates that 

1. Version 1.3 of file "s.abc" was retrieved (1.3 is the latest 
trunk delta). 

2. This version has 67 lines of text. 

3. No ID keywords were substituted in the file. 
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The generated g-file (file "abc") is given mode 444 (read-only). 
This particular way of invoking get is intended to produce g- 
files only for inspection, compilation, etc. It is not intended for 
editing (i.e., not for making deltas). 

In the case of several file arguments (or directory-name 
arguments), similar information is given for each file processed, 
but the sees file name precedes it. For example, 

get s.abc s.def 

produces 

s.abc: 

1.3 

67 lines 

No id keywords (cm7) 

s.def: 

1.7 

85 lines 

No id keywords (cm7) 



ID Keywords 

In generating a g-file to be used for compilation, it is useful and 
informative to record the date and time of creation, the version 
retrieved, the module's name, etc. within the g-file. This 
information appears in a load module when one is eventually 
created. SeeS provides a convenient mechanism for doing this 
automatically. Identification (ID) keywords appearing 
anywhere in the generated file are replaced by appropriate 
values according to the definitions of these ID keywords. The 
format of an ID keyword is an uppercase letter enclosed by 
percent signs (%). For example, 
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%1% 



is defined as the ID keyword that is replaced by the SID of the 
retrieved version of a file. Similarly, %H% is defined as the 
ID keyword for the current date (in the form "mm/dd/yy"), 
and %M% is defined as the name of the g-file. Thus, executing 
get on an SCCS file that contains the PL/I declaration, 

DCLIDCHAR(IOO) VARINIT('%M% %I% %H%'); 

gives (for example) the following: 

DCL ID CHAR(IOO) VAR INIT('MODNAME 2.3 07/07/77'); 

When no ID keywords are substituted by get, the following 
message is issued: 

No id keywords (cm7) 

This message is normally treated as a warning by get, 
although the presence of the i flag in the SCCS file causes it to 
be treated as an error. For a complete list of the 
approximately 20 ID keywords provided, see get(l) in the 
AT&T UNIX PC UNIX System V Manual. 

Retrieval of Different Versions 

Various keyletters are provided to allow the retrieval of other 
than the default version of an SCCS file. Normally, the default 
version is the most recent delta of the highest-numbered 
release on the trunk of the SCCS file tree. However, if the 
SCCS file being processed has a d (default SID) flag, the SID 
specified as the value of this flag is used as a default. The 
default SID is interpreted in exactly the same way as the value 
supplied with the — r keyletter of get. 
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The — r keyletter is used to specify an SID to be retrieved, in 
which case the d (default SID) flag (if any) is ignored. For 
example, 

get -rl.3 s.abc 

retrieves version 1.3 of file s.abc and produces (for example) on 
the standard output 

1.3 

64 lines 

A branch delta may be retrieved similarly, 

get -rl.5.2.3 s.abc 

which produces (for example) on the standard output 

1.5.2.3 
234 lines 

When a 2- or 4-component SID is specified as a value for the 
— r keyletter (as above) and the particular version does not 
exist in the SCCS file, the following error message results. 
ERROR[s.filename]: nonexistent SID (ge5) 

Omission of the level number, as in 

get -r3 s.abc 

causes retrieval of the trunk delta with the highest level 
number within the given release if the given release exists. 
Thus, the above command might output, 
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3.7 

213 lines 

If the given release does not exist, get retrieves the trunk delta 
with the highest level number within the highest-numbered 
existing release that is lower than the given release. For 
example, assuming release 9 does not exist in file s.abc and that 
release 7 is actually the highest-numbered release below 9, 
execution of 

get -r9 s.abc 

might produce 

7.6 

420 lines 

which indicates that trunk delta 7.6 is the latest version of file 
s.abc below release 9. Similarly, omission of the sequence 
number, as in 

get -r4.3.2 s.abc 

results in the retrieval of the branch delta with the highest 
sequence number on the given branch if it exists. (If the given 
branch does not exist, an error message results.) This might 
result in the following output: 

4.3.2.8 
89 lines 

The — t keyletter is used to retrieve the latest (top) version in a 
particular release (i.e., when no — r keyletter is supplied or 
when its value is simply a release number). The latest version 
is defined as that delta which was produced most recently, 
independent of its location on the SCCS file tree. Thus, if the 

14-23 



sees 

most recent delta in release 3 is 3.5, 

get -r3 -t s.abc 
might produce 

3.5 

59 lines 

However, if branch delta 3.2.1.5 were the latest delta (created 
after delta 3.5), the same command might produce 

3.2.1.5 
46 lines 



Retrieval With Intent to Make a Delta 

Specification of the — e keyletter to the get command is an 
indication of the intent to make a delta, and as such, its use is 
restricted. The presence of this keyletter causes get to check 



1. The user list (a list of login names and/or group IDs of 
users allowed to make deltas) to determine if the login 
name or group ID of the user executing get is on that 
list. Note that a null (empty) user list behaves as if it 
contained all possible login names. 

2. The release (R) of the version being retrieved satisfies 
the relation: 

floor is < or = to R which is 
< or = to ceiling 

to determine if the release being accessed is a protected 
release. The "floor" and "ceiling" are specified as flags in 
the sees file. 
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3. The R is not locked against editing. The "lock" is 
specified as a flag in the SCCS file. 

4. Whether or not multiple concurrent edits are allowed for 
the SCCS file as specified by the j flag in the SCCS file. 

A failure of any of the first three conditions causes the 
processing of the corresponding SCCS file to terminate. 

If the above checks succeed, the — e keyletter causes the 
creation of a g-file in the current directory with mode 644 
(readable by everyone, writable only by the owner) owned by 
the real user. If a writable g-file already exists, get terminates 
with an error. This is to prevent inadvertent destruction of a 
g-file that already exists and is being edited for the purpose of 
making a delta. 

Any ID keywords appearing in the g-file are not substituted by 
get (when the — e keyletter is specified) because the generated 
g-file is subsequently used to create another delta. 
Replacement of ID keywords causes them to be permanently 
changed within the SCCS file. In view of this, get does not 
need to check for the presence of ID keywords within the g-file, 
so the message 

No id keywords (cm7) 

is never output when get is invoked with the — e keyletter. 

In addition, the — e keyletter causes the creation (or updating) 
of a p-file which is used to pass information to the delta 
command. 

The following is an example of the use of the — e keyletter: 
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get -e s.abc 

which produces (for example) on the standard output 

1.3 

new delta 1.4 

67 lines 

If the — r and/or — t key letters are used together with the — e 
keyletter, the version retrieved for editing is as specified by the 
— r and/or — t keyletters. However, it is redundant to use both 
the — r and — t keyletters. 

The keyletters — i and — x may be used to specify a list [see 
get(l) in the AT&T UNIX PC UNIX System V Manual for the 
syntax of such a list] of deltas to be included and excluded, 
respectively, by get. Including a delta means forcing the 
changes that constitute the particular delta to be included in 
the retrieved version. This is useful if one wants to apply the 
same changes to more than one version of the SCCS file. 
Excluding a delta means forcing it not to be applied. This may 
be used to undo (in the version of the SCCS file to be created) 
the effects of a previous delta. Whenever deltas are included or 
excluded, get checks for possible interference between such 
deltas and those deltas that are normally used in retrieving the 
particular version of the SCCS file. Two deltas can interfere, 
for example, when each one changes the same line of the 
retrieved g-file. Any interference is indicated by a warning 
that shows the range of lines within the retrieved g-file in 
which the problem may exist. The user is expected to examine 
the g-file to determine whether a problem actually exists and to 
take whatever corrective measures (if any) are deemed 
necessary (e.g., edit the file). 

Warning: The — i and — x keyletters should be used with 
extreme care. 
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The — k keyletter is provided to facilitate regeneration of a g- 
file that may have been accidentally removed or ruined 
subsequent to the execution of get with the — e keyletter or to 
simply generate a g-file in which the replacement of ID 
keywords has been suppressed. Thus, a g-file generated by the 
— k keyletter is identical to one produced by get and executed 
with the — e keyletter. However, no processing related to the 
p-file takes place. 



Concurrent Edits of Different SID 

The ability to retrieve different versions of an SCCS file allows 
a number of deltas to be "in progress" at any given time. This 
means that a number of get commands with the — e keyletter 
may be executed on the same file provided that no two 
executions retrieve the same version (unless multiple 
concurrent edits are allowed). 



The p-file (created by the get command invoked with the — e 
keyletter) is named by replacing the "s." in the SCCS file name 
with "p.". It is created in the directory containing the SCCS 
file, given mode 644 (readable by everyone, writable only by the 
owner), and owned by the effective user. The p-file contains the 
following information for each delta that is still "in progress": 

• The SID of the retrieved version. 

• The SID that is given to the new delta when it is created. 

• The login name of the real user executing get. 

The first execution of get — e causes the creation of the p-file 
for the corresponding SCCS file. Subsequent executions only 
update the p-file with a line containing the above information. 
Before updating, however, get checks to assure that no entry 
(already in the p-file) specifies that the SID (of the version to 
be retrieved) is already retrieved (unless multiple concurrent 
edits are allowed). 
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If both checks succeed, the user is informed that other deltas 
are in progress and processing continues. If either check fails, 
an error message results. It is important to note that the 
various executions of get should be carried out from different 
directories. Otherwise, only the first execution succeeds since 
subsequent executions would attempt to overwrite a writable g- 
file, which is an SCCS error condition. In practice, such 
multiple executions are performed by different users so that 
this problem does not arise since each user normally has a 
different working directory. See "Protection" under the part 
"SCCS FILES" for a discussion of how different users are 
permitted to use SCCS commands on the same files. 

Figure 14-4 shows, for the most useful cases, the version of an 
SCCS file retrieved by get, as well as the SID of the version to 
be eventually created by delta, as a function of the SID 
specified to get. 
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See footnotes on sheet 3 of 3. 



Figure 14-4. Determination of New SID (Sheet I of 3) 
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See footnotes on sheet 3 of 3. 



Figure 14-4. Determination of New SID (Sheet 2 of 3) 
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Footnotes: 

* " R" , " L" , " B" , and " S" are " release" , " level" , " branch" , 
and " sequence" components of the SID, respectively; " m" 
means " maximum" , Thus, for example, " R.mL" means " the 
maximum level number within release R" ; " R.L.(mB+l).r' 
means " the first sequence number on the (i.e., maximum 
branch number plus 1) of level L within release R" . Also note 
that if the SID specified is of the form " R.L" , " R.L.B" , or 
" R.L.B.S" , each of the specified components must exist. 

t The — b keyletter is effective only if the b flag [see 
admin(l)] is present in the file. In this state, an entry of " -" 
means " irrelevant" . 

if This case applies if the d (default SID) flag is not present in 
the file. If the d flag is present in the file, the SID obtained 
from the d flag is interrupted as if it had been specified on 
the command line. Thus, one of the other cases in this figure 
applies. 

§ This case is used to force the creation of the first delta in the 
new release. 

** " hR" is the highest existing release that is lower than the 
specified, nonexisting, release R. 



Figure 14-4. Determination of New SID (Sheet 3 of 3) 



Concurrent Edits of Same SID 

Under normal conditions, gets for editing (— e keyletter is 
specified) based on the same SID are not permitted to occur 
concurrently. That is, delta must be executed before a 
subsequent get for editing is executed at the same SID as the 
previous get. However, multiple concurrent edits (defined to 
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be two or more successive executions of get for editing based 
on the same retrieved SID) are allowed if the j flag is set in the 
sees file. Thus: 

get -e s.abc 

1.1 

new delta 1.2 

5 lines 

may be immediately followed by 

get -e s.abc 

1.1 

new delta 1.1.1.1 

5 lines 

without an intervening execution of delta. In this case, a 
delta command corresponding to the first get produces delta 
1.2 [assuming 1.1 is the latest (most recent) trunk delta], and 
the delta command corresponding to the second get produces 
delta 1.1.1.1. If there is concurrent editing taking place the 
user will have to specify the release level information within 
the delta command. 



Keyletters That Affect Output 

Specification of the — p keyletter causes get to write the 
retrieved text to the standard output rather than to a g-file. In 
addition, all output normally directed to the standard output 
(such as the SID of the version retrieved and the number of 
lines retrieved) is directed instead to the diagnostic output. 
This may be used, for example, to create g-files with arbitrary 
names. 



get -p s.abc > arbitrary-file-name 
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The — s keyletter suppresses all output that is normally directed 
to the standard output. Thus, the SID of the retrieved version, 
the number of lines retrieved, etc., are not output. This does 
not, however, affect messages to the diagnostic output. This 
keyletter is used to prevent nondiagnostic messages from 
appearing on the user's terminal and is often used in 
conjunction with the — p keyletter to "pipe" the output of get, 
as in 

get -p -s s.abc | nroff 

The — g keyletter is supplied to suppress the actual retrieval of 
the text of a version of the SCCS file. This may be useful in a 
number of ways. For example, to verify the existence of a 
particular SID in an SCCS file, one may execute 

get -g -r4.3 s.abc 

This outputs the given SID if it exists in the SCCS file or it 
generates an error message if it does not. Another use of the 
— g keyletter is in regenerating a p-file that may have been 
accidentally destroyed. 

get -e -g s.abc 

The —1 keyletter causes the creation of an l-file, which is named 
by replacing the "s." of the SCCS file name with "1.". This file 
is created in the current directory with mode 444 (read-only) 
and is owned by the real user. It contains a table [whose 
format is described in get(l) in the AT&T UNIX PC UNIX 
System V Manual showing the deltas used in constructing a 
particular version of the SCCS file. For example, 

get -r2.3 -1 s.abc 
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generates an l-file showing the deltas applied to retrieve version 
2.3 of the sees file. Specifying a value of "p" with the -1 
keyletter, as in 

get -Ip -r2.3 s.abc 

causes the generated output to be written to the standard 
output rather than to the l-file. The — g keyletter may be used 
with the —1 keyletter to suppress the actual retrieval of the 
text. 

The — m keyletter is of use in identifying, line by line, the 
changes applied to an SeeS file. Specification of this keyletter 
causes each line of the generated g-file to be preceded by the 
SID of the delta that caused that line to be inserted. The SID 
is separated from the text of the line by a tab character. 

The — n keyletter causes each line of the generated g-file to be 
preceded by the value of the sccsl ID keyword and a tab 
character. The — n keyletter is most often used in a pipeline 
with grep(l). For example, to find all lines that match a given 
pattern in the latest version of each SeeS file in a directory, 
the following may be executed: 

get -p -n -s directory | grep pattern 

If both the — m and — n keyletters are specified, each line of the 
generated g-file is preceded by the value of the %M% ID 
keyword and a tab (this is the effect of the — n keyletter) and 
followed by the line in the format produced by the — m 
keyletter. Because use of the — m keyletter and/or the — n 
keyletter causes the contents of the g-file to be modified, such a 
^-/*7e must not be used for creating a delta. Therefore, neither 
the — m keyletter nor the — n keyletter may be specified 
together with the — e keyletter. 
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See get(l) in the AT&T UNIX PC UNIX Systen V Manual for 
a full description of additional get keyletters. 



B. The "delta*' eommand 

The delta command is used to incorporate the changes made to 
a g-file into the corresponding SCCS file, i.e., to create a delta, 
and therefore, a new version of the file. 

Invocation of the delta command requires the existence of a p- 
file. The delta command examines the p-file to verify the 
presence of an entry containing the user's login name. If none 
is found, an error message results. The delta command 
performs the same permission checks that get performs when 
invoked by the — e keyletter. If all checks are successful, delta 
determines what has been changed in the g-file by comparing it 
via diff(l) with its own temporary copy of the g-file as it was 
before editing. This temporary copy of the g-file is called the 
d-file (its name is formed by replacing the "s." of the SCCS file 
name with "d.") and is obtained by performing an internal get 
at the SID specified in the p-file entry. 

The required p-file entry is the one containing the login name 
of the user executing delta because the user who retrieved the 
g-file must be the one who creates the delta. However, if the 
login name of the user appears in more than one entry, the 
same user has executed get with the — e keyletter more than 
once on the same SCCS file. The — r keyletter must then be 
used with delta to specify the SID that uniquely identifies the 
p-file entry. This entry is the one used to obtain the SID of the 
delta to be created. 

In practice, the most common invocation of delta is 

delta s.abc 
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which prompts on the standard output (but only if it is a 
terminal) 

comments? 

to which the user replies with a description of why the delta is 
being made, terminating the reply with a newline character. 
The user's response may be up to 512 characters long with 
newlines (not intended to terminate the response) escaped by 
backslashes "\". 

If the sees file has a v flag, delta first prompts with 

MRs? (Modification Requests) 

on the standard output. (Again, this prompt is printed only if 
the standard output is a terminal.) The standard input is then 
read for MR numbers, separated by blanks and/or tabs, 
terminated in the same manner as the response to the prompt 
"comments?". In a tightly controlled environment, it is 
expected that deltas are created only as a result of some trouble 
report, change request, trouble ticket, etc., collectively called 
[MRs]. It is desirable (or necessary) to record such MR 
number(s) within each delta. 

The — y and/or — m keyletters may be used to supply the 
commentary (comments and MR numbers, respectively) on the 
command line rather than through the standard input. 

delta -y" descriptive comment" -m" mrnuml mrnum2" 

In this case, the corresponding prompts are not printed, and the 
standard input is not read. The — m keyletter is allowed only if 
the sees file has a v flag. These keyletters are useful when 
delta is executed from within a shell procedure [see sh(l) in 
the AT&T VNIX PC UNIX System V Manual] 
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The commentary (comments and/or MR numbers), whether 
solicited by delta or supplied via keyletters, is recorded as part 
of the entry for the delta being created and applies to all SCCS 
files processed by the same invocation of delta. This implies 
that (if delta is invoked with more than one file argument and 
the first file named has a v flag) all files named must have this 
flag. Similarly, if the first file named does not have this flag, 
then none of the files named may have it. Any file that does 
not conform to these rules is not processed. 

When processing is complete, delta outputs (on the standard 
output) the SID of the created delta (obtained from the p-file 
entry) and the counts of lines inserted, deleted, and left 
unchanged by the delta. Thus, a typical output might be 

1.4 

14 inserted 

7 deleted 

345 unchanged 

It is possible that the counts of lines reported as inserted, 
deleted, or unchanged by delta do not agree with the user's 
perception of the changes applied to the g-file. The reason for 
this is that there usually are a number of ways to describe a set 
of such changes, especially if lines are moved around in the g- 
file, and delta is likely to find a description that differs from 
the user's perception. However, the total number of lines of the 
new delta (the number inserted plus the number left 
unchanged) should agree with the number of lines in the edited 
g-file. 

If (in the process of making a delta) delta finds no ID 
keywords in the edited g-file, the message 

No id keywords (cm7) 
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is issued after the prompts for commentary but before any 
other output. This indicates that any ID keywords that may 
have existed in the SCCS file have been replaced by their values 
or deleted during the editing process. This could be caused by 
creating a delta from a g-file that was created by a get without 
the — e keyletter (recall that ID keywords are replaced by get 
in that case). This could also be caused by accidentally deleting 
or changing the ID keywords during the editing of the g-file. 
Another possibility is that the file had no ID keywords. In any 
case, it is left up to the user to determine what remedial action 
is necessary. However, the delta is made unless there is an i 
flag in the SCCS file indicating that this should be treated as a 
fatal error. In this last case, the delta is not created. 

After the processing of an SCCS file is complete, the 
corresponding jp-file entry is removed from the p-file. All 
updates to the p-file are made to a temporary copy, the q-file, 
whose use is similar to the use of the x-file which is described 
in the part "SCCS COMMAND CONVENTIONS". If there is 
only one entry in the p-file, then the p-file itself is removed. 

In addition, delta removes the edited g-file unless the — n 
keyletter is specified. Thus: 

delta -n s.abc 

will keep the g-file upon completion of processing. 

The — s (silent) keyletter suppresses all output that is normally 
directed to the standard output, other than the prompts 
"comments?" and "MRs?". Thus, use of the -s keyletter 
together with the — y keyletter (and possibly, the — m keyletter) 
causes delta neither to read the standard input nor to write 
the standard output. 

The differences between the g-file and the d-file (see above), 
constitute the delta and may be printed on the standard output 
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by using the — p keyletter. The format of this output is similar 
to that produced by diff(l). 



C. The "admin" eommand 

The admin command is used to administer SCCS files, that is, 
to create new SCCS files and to change parameters of existing 
ones. When an SCCS file is created, its parameters are 
initialized by use of keyletters or are assigned default values if 
no keyletters are supplied. The same keyletters are used to 
change the parameters of existing files. 

Two keyletters are supplied for use in conjunction with 
detecting and correcting "corrupted" SCCS files (see "Auditing" 
in part "SCCS FILES"). Newly created SCCS files are given 
mode 444 (read-only) and are owned by the effective user. Only 
a user with write permission in the directory containing the 
SCCS file may use the admin command upon that file. 



Creation of SCCS Files 

An SCCS file may be created by executing the command 

admin -ifirst s.abc 

in which the value "first" of the — i keyletter specifies the name 
of a file from which the text of the initial delta of the SCCS file 
s.abc is to be taken. Omission of the value of the — i keyletter 
indicates that admin is to read the standard input for the text 
of the initial delta. Thus, the command 

admin -i s.abc < first 

is equivalent to the previous example. If the text of the initial 
delta does not contain ID keywords, the message 
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No id keywords (cm7) 

is issued by admin as a warning. However, if the same 
invocation of the command also sets the i flag (not to be 
confused with the — i keyletter), the message is treated as an 
error and the SCCS file is not created. Only one SCCS file may 
be created at a time using the — i keyletter. 

When an SCCS file is created, the release number assigned to 
its first delta is normally "1", and its level number is always 
"1". Thus, the first delta of an SCCS file is normally "1.1". 
The — r keyletter is used to specify the release number to be 
assigned to the first delta. Thus: 



admin -ifirst -r3 s.abc 

indicates that the first delta should be named "3.1" rather than 
"1.1". Because this ke'^^'letter is only meaningful in creatine the 
first delta, its use is only permitted with the — i keyletter. 

Inserting Commentary for the Initial Delta 

When an SCCS file is created, the user may choose to supply 
commentary stating the reason for creation of the file. This is 
done by supplying comments (— y keyletter) and/or MR 
numbers (— m keyletter) in exactly the same manner as for 
delta. The creation of an SCCS file may sometimes be the 
direct result of an MR. If comments (— y keyletter) are 
omitted, a comment line of the form 

date and time created YY/MM/DD HH:MM:SS by lognar 

is automatically generated. 

If it is desired to supply MR numbers (— m keyletter), the v 
flag must also be set (using the — f keyletter described below). 
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The V flag simply determines whether or not MR numbers must 
be supplied when using any SCCS command that modifies a 
"delta commentary" [see sccsfile(4) in the AT&T UNIX PC 
UNIX System V Manual] in the SCCS file. Thus: 

admin -ifirst -mmrnuml -fv s.abc 

Note that the — y and — m keyletters are only effective if a new 
SCCS file is being created. 



Initialization and Modification of SCCS File Parameters 

The portion of the SCCS file reserved for descriptive text may 
be initialized or changed through the use of the — t keyletter. 
The descriptive text is intended as a summary of the contents 
and purpose of the SCCS file. 



When an SCCS file is being created and the — t keyletter is 
supplied, it must be followed by the name of a file from which 
the descriptive text is to be taken. For example, the command 

admin -ifirst -tdesc s.abc 

specifies that the descriptive text is to be taken from file desc;. 

When processing an existing SCCS file, the — t keyletter 
specifies that the descriptive text (if any) currently in the file 
is to be replaced with the text in the named file. Thus: 

admin -tdesc s.abc 

specifies that the descriptive text of the SCCS file is to be 
replaced by the contents of desc; omission of the file name after 
the — t keyletter as in 
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admin -t s.abc 

causes the removal of the descriptive text from the SCCS file. 

The flags of an SCCS file may be initialized, changed, or 
deleted through the use of the — f and — d keyletters, 
respectively. The flags of an SCCS file are used to direct 
certain actions of the various commands. See admin(l) in the 
AT&T UNIX PC UNIX System V Manual for a description of 
all the flags. For example, the i flag specifies that the warning 
message (stating that there are no ID keywords contained in 
the SCCS file) should be treated as an error. Also the d 
(default SID) flag specifies the default version of the SCCS file 
to be retrieved by the get command. The — f keyletter is used 
to set a flag and, possibly, to set its value. For example, 

admin -ifirst -fi -fmmodname s.abc 

sets the i flag and the m (module name) flag. The value 
"modname" specified for the m flag is the value that the get 
command will use to replace the %M% ID keyword. (In the 
absence of the m flag, the name of the g-file is used as the 
replacement for the %M% ID keyword.) Note that several — f 
keyletters may be supplied on a single invocation of admin and 
that — f keyletters may be supplied whether the command is 
creating a new SCCS file or processing an existing one. 

The — d keyletter is used to delete a flag from an SCCS file and 
may only be specified when processing an existing file. As an 
example, the command 

admin -dm s.abc 

removes the m flag from the SCCS file. Several — d keyletters 
may be supplied on a single invocation of admin and may be 
intermixed with — f keyletters. 
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The sees files contain a list (user list) of login names and/or 
group IDs of users who are allowed to create deltas. This list is 
empty by default which implies that anyone may create deltas. 
To add login names and/or group IDs to the list, the —a 
keyletter is used. For example, 

admin -axyz -awql -al234 s.abc 

adds the login names "xyz" and "wql" and the group ID "1234" 
to the list. The —a keyletter may be used whether admin is 
creating a new SeeS file or processing an existing one and may 
appear several times. The — e keyletter is used in an analogous 
manner if one wishes to remove (erase) login names or group 
IDs from the list. 



D. The "prs" eommand 

The prs command is used to print on the standard output all or 
parts of an SeeS file in a format, called the output "data 
specification," supplied by the user via the — d keyletter. The 
data specification is a string consisting of SeeS file data 
keywords (not to be confused with get ID keywords) 
interspersed with optional user text. 

Data keywords are replaced by appropriate values according to 
their definitions. For example. 



:I: 



is defined as the data keyword that is replaced by the SID of a 
specified delta. Similarly, :F: is defined as the data keyword 
for the sees file name currently being processed, and :C: is 
defined as the comment line associated with a specified delta. 
All parts of an SeeS file have an associated data keyword. For 
a complete list of the data keywords, see prs(l) in the AT&T 
UNIX PC UNIX System V Manual. 
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There is no limit to the number of times a data keyword may 
appear in a data specification. Thus, for example, 

prs -d" :I: this is the top delta for :F: :I:" s.abc 

may produce on the standard output 

2.1 this is the top delta for s.abc 2.1 

Information may be obtained from a single delta by specifying 
the SID of that delta using the — r keyletter. For example, 

prs -d" :F:: :I: comment line is: :C:" -rl.4 s.abc 

may produce the following output: 

s.abc: 1.4 comment line is: THIS IS A COMMENT 

If the — r keyletter is not specified, the value of the SID 
defaults to the most recently created delta. 

In addition, information from a range of deltas may be 
obtained by specifying the —1 or — e keyletters. The — e 
keyletter substitutes data keywords for the SID designated via 
the — r keyletter and all deltas created earlier. The —1 keyletter 
substitutes data keywords for the SID designated via the — r 
keyletter and all deltas created later. Thus, the command 

prs -d:I: -rl.4 -e s.abc 

may output 
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1.4 

1.3 

1.2.1.1 

1.2 

1.1 

and the command 

prs -d:I: -rl.4 -1 s.abc 

may produce 

3.3 

3.2 

3.1 

2.2.1.1 

2.2 

2.1 

1.4 

Substitution of data keywords for all deltas of the SCCS file 
may be obtained by specifying both the — e and —1 keyletters. 



E. The "help'* eommand 

The help command prints explanations of SCCS commands and 
of messages that these commands may print. Arguments to 
help, zero or more of which may be supplied, are simply the 
names of SCCS commands or the code numbers that appear in 
parentheses after SCCS messages. If no argument is given, 
help prompts for one. The help command has no concept of 
keyletter arguments or file arguments. Explanatory 
information related to an argument, if it exists, is printed on 
the standard output. If no information is found, an error 
message is printed. Note that each argument is processed 
independently, and an error resulting from one argument will 
not terminate the processing of the other arguments. 
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Explanatory information related to a command is a synopsis of 
the command. For example, 

help ge5 rmdel 

produces 

ge5: 

" nonexistent sid" 

The specified sid does not exist in the 

given file. 

Check for typos. 

rmdel; 
rmdel -rSID name ... 

F. The "rmdel" eommand 

The rmdel command is provided to allow removal of a delta 
from an SCCS file. Its use should be reserved for those cases in 
which incorrect global changes were made a part of the delta to 
be removed. 

The delta to be removed must be a "leaf delta. That is, it 
must be the latest (most recently created) delta on its branch 
or on the trunk of the SCCS file tree. In Figure 14-3, only 
deltas 1.3.1.2, 1.3.2.2, and 2.2 can be removed; once they are 
removed, then deltas 1.3.2,1 and 2,1 can be removed, etc. 

To be allowed to remove a delta, the effective user must have 
write permission in the directory containing the SCCS file. In 
addition, the real user must either be the one who created the 
delta being removed or be the owner of the SCCS file and its 
directory. 

The — r keyletter, which is mandatory, is used to specify the 
complete SID of the delta to be removed (i.e., it must have two 
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components for a trunk delta and four components for a branch 
delta). Thus: 

rmdel -r2.3 s.abc 

specifies the removal of (trunk) delta "2.3" of the SCCS file. 
Before removal of the delta, rmdel checks that the release 
number (R) of the given SID satisfies the relation: 

floor <= R <= ceiling 

The rmdel command also checks that the SID specified is not 
that of a version for which a get for editing has been executed 
and whose associated delta has not yet been made. In 
addition, the login name or group ID of the user must appear in 
the file's "user list", or the "user list" must be empty. Also, 
the release specified cannot be locked against editing. That is, 
if the 1 flag is set [see admin(l) in the AT&T UNIX PC UNIX 
System V Manual], the release specified must not be contained 
in the list. If these conditions are not satisfied, processing is 
terminated, and the delta is not removed. After the specified 
delta has been removed, its type indicator in the "delta table" 
of the SCCS file is changed from "D" ("delta") to "R" 
("removed"). 



G. The "cdc" eommand 

The cdc command is used to change a delta's commentary that 
was supplied when that delta was created. Its invocation is 
analogous to that of the rmdel command, except that the delta 
to be processed is not required to be a leaf delta. For example, 

cdc -r3.4 s.abc 

specifies that the commentary of delta " 3.4" of the SCCS file is 
to be changed. 
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The new commentary is solicited by cdc in the same manner as 
that of delta. The old commentary associated with the 
specified delta is kept, but it is preceded by a comment line 
indicating that it has been changed (i.e., superseded), and the 
new commentary is entered ahead of this comment line. The 
"inserted" comment line records the login name of the user 
executing cdc and the time of its execution. 

The cdc command also allows for the deletion of selected MR 
numbers associated with the specified delta. This is specified 
by preceding the selected MR numbers by the character "!". 
Thus: 

cdc -rl.4 s.abc 
MRs? mrnumS Imrnuml 

comments? deleted wrong MR number and inserted 
correct MR number 

inserts "mrnumS" and deletes "mrnuml" for delta 1.4. 



H. The "what" eommand 

The what command is used to find identifying information 
within any UNIX system file whose name is given as an 
argument to what. Directory names and a name of "-" (a lone 
minus sign) are not treated specially as they are by other SCCS 
commands and no keyletters are accepted by the command. 

The what command searches the given file(s) for all 
occurrences of the string "@(#)", which is the replacement for 
the @(#) ID keyword [see get(l)], and prints (on the standard 
output) the balance following that string until the first double 
quote (" ), greater than (>), backslash (\), newline, or 
(nonprinting) NUL character. For example, if the SCCS file 
s.prog.c (a C language program) contains the following line: 



14-48 



sees 

char id[] " @(#)sccs2:5.r' ; 

and then the command 

get -r3,4 s.prog.c 

is executed, the resulting g-file is compiled to produce "prog.o" 
and "a.out". Then the command 

what prog.c prog.o a.out 

produces 



prog.c: 

prog.c:3.4 
prog.o: 

prog.c:3.4 
a.out: 

prog.c:3.4 

The string searched for by what need not be inserted via an ID 
keyword of get; it may be inserted in any convenient manner. 



I. The "sccsdiff ' Command 

The sccsdiff command determines (and prints on the standard 
output) the differences between two specified versions of one or 
more SCCS files. The versions to be compared are specified by 
using the — r keyletter, whose format is the same as for the get 
command. The two versions must be specified as the first two 
arguments to this command in the order they were created, i.e., 
the older version is specified first. Any following keyletters are 
interpreted as arguments to the pr(l) command (which actually 
prints the differences) and must appear before any file names. 
The SCCS files to be processed are named last. Directory 
names and a name of "-" (a lone minus sign) are not acceptable 
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to sccsdiff. 



The differences are printed in the form generated by diff(l). 
The following is an example of the invocation of sccsdiff: 

sccsdiff -r3.4 -r5.6 s.abc 



J. The "comb" eommand 

The comb command generates a "shell procedure" [see sh(l) in 
the AT&T UNIX PC UNIX System V Manual] which attempts 
to reconstruct the named SCCS files so that the reconstructed 
files are smaller than the originals. The generated shell 
procedure is written on the standard output. Named SCCS files 
are reconstructed by discarding unwanted deltas and combining 
other specified deltas. The SCCS files that contain deltas no 
longer useful should be discarded. It is not recommended that 
comb be used as a matter of routine; its use should be 
restricted to a very small number of times in the life of an 
SCCS file. 

In the absence of any keyletters, comb preserves only leaf 
deltas and the minimum number of ancestor deltas necessary to 
preserve the "shape" of the SCCS file tree. The effect of this is 
to eliminate middle deltas on the trunk and on all branches of 
the tree. Thus, in Figure 14-3, deltas 1.2, 1.3.2.1, 1.4, and 2.1 
would be eliminated. Some of the keyletters are summarized as 
follows: 

The — p keyletter specifies the oldest delta that is to be 
preserved in the reconstruction. All older deltas are 
discarded. 
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The — c keyletter specifies a list [see get(l) in the AT&T 
UNIX PC UNIX System V Manual for the syntax of such a 
list] of deltas to be preserved. All other deltas are 
discarded. 

The — s keyletter causes the generation of a shell 
procedure, which when run, produces only a report 
summarizing the percentage space (if any) to be saved by 
reconstructing each named SCCS file. It is recommended 
that comb be run with this keyletter (in addition to any 
others desired) before any actual reconstructions. 

It should be noted that the shell procedure generated by comb 
is not guaranteed to save space. In fact, it is possible for the 
reconstructed file to be larger than the original. Note, too, that 
the shape of the SCCS file tree may be altered by the 
reconstruction process. 



K. The "vaF eommand 

The val command is used to determine if a file is an SCCS file 
meeting the characteristics specified by an optional list of 
keyletter arguments. Any characteristics not met are 
considered errors. 

The val command checks for the existence of a particular delta 
when the SID for that delta is explicitly specified via the — r 
keyletter. The string following the — y or — m keyletter is used 
to check the value set by the t or m flag, respectively [see 
admin(l) in the AT&T UNIX PC UNIX System V Manual for 
a description of the flags]. 

The val command treats the special argument "— " differently 
from other SCCS commands. This argument allows val to read 
the argument list from the standard input as opposed to 
obtaining it from the command line. The standard input is 
read until end of file. This capability allows for one invocation 
of val with different values for the keyletter and file 
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arguments. For example, 

val - 

-yc -mabc s.abc 

-mxyz -ypU s.xyz 

first checks if file s.abc has a value "c" for its "type" flag and 
value "abc" for the "module name" flag. Once processing of the 
first file is completed, val then processes the remaining files, 
in this case, s.xyz, to determine if they meet the characteristics 
specified by the keyletter arguments associated with them. 

The val command returns an 8-bit code; each bit set indicates 
the occurrence of a specific error [see val(l) for a description 
of possible errors and the codes]. In addition, an appropriate 
diagnostic is printed unless suppressed by the — s keyletter. A 
return code of "0" indicates all named files met the 
characteristics specified. 



sees FILES 

This part discusses several topics that must be considered 
before extensive use is made of SCCS. These topics deal with 
the protection mechanisms relied upon by SCCS, the format of 
SCCS files, and the recommended procedures for auditing 
SCCS files. 



A. Protection 

The SCCS relies on the capabilities of the UNIX software for 
most of the protection mechanisms required to prevent 
unauthorized changes to SCCS files (i.e., changes made by non- 
SCCS commands). The only protection features provided 
directly by SCCS are the "release lock" flag, the "release floor" 
and "ceiling" flags, and the "user list". 
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New sees files created by the admin command are given 
mode 444 (read-only). It is recommended that this mode 
remain unchanged as it prevents any direct modification of the 
files by non-SeeS commands. It is further recommended that 
the directories containing SeeS files be given mode 755 which 
allows only the owner of the directory to modify its contents. 

The sees files should be kept in directories that contain only 
sees files and any temporary files created by SCeS commands. 
This simplifies protection and auditing of SeeS files. The 
contents of directories should correspond to convenient logical 
groupings, e.g., subsystems of a large project. 

The sees files must have only one link (name) because the 
commands that modify SeeS files do so by creating a copy of 
the file (the x-file, see "SeeS eOMMAND CONVENTIONS"). 
Upon completion of processing, remove the old file and rename 
the x-file. If the old file has more than one link, this would 
break such additional links. Rather than process such files, 
sees commands produce an error message. All SOOS files 
TYiust have names that begin with "s.". 

When only one user uses SOOS, the real and effective user IDs 
are the same; and the user ID owns the directories containing 
sees files. Therefore, SOOS may be used directly without any 
preliminary preparation. 

However, in those situations in which several users with unique 
user IDs are assigned responsibility for one SOOS file (e.g., in 
large software development projects), one user (equivalently, 
one user ID) must be chosen as the "owner" of the SOOS files 
and be the one who will "administer" them (e.g., by using the 
admin command). This user is termed the "SOOS 
administrator" for that project. Because other users of SOOS 
do not have the same privileges and permissions as the SOOS 
administrator, they are not able to execute directly those 
commands that require write permission in the directory 
containing the SOOS files. Therefore, a project-dependent 
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program is required to provide an interface to the get, delta, 
and if desired, rmdel and cdc commands. 

The interface program must be owned by the SCCS 
administrator and must have the "set user ID on execution" bit 
"on" [see chmod(l) in the AT&T UNIX PC UNIX Systen V 
Manual]. This assures that the effective user ID is the user ID 
of the administrator. This program invokes the desired SCCS 
command and causes it to inherit the privileges of the interface 
program for the duration of that command's execution. Thus, 
the owner of an SCCS file can modify it at will. Other users 
whose login names or group IDs are in the "user list" for that 
file (but are not the owner) are given the necessary permissions 
only for the duration of the execution of the interface program. 
Other users are thus able to modify the SCCS files only 
through the use of delta and, possibly, rmdel and cdc. The 
project-dependent interface program, as its name implies, must 
be custom-built for each project. 



B. Formatting 

The SCCS files are composed of lines of ASCII text arranged in 
six parts as follows: 

Checksum A line containing the "logical" sum of all 

the characters of the file (not including 
this checksum itself). 

Delta Table Information about each delta, such as 

type, SID, date and time of creation, and 
commentary. 

User Names List of login names and/or group IDs of 

users who are allowed to modify the file 
by adding or removing deltas. 

Flags Indicators that control certain actions of 

various SCCS commands. 
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Descriptive Text Arbitrary text provided by the user; 
usually a summary of the contents and 
purpose of the file. 

Body Actual text that is being administered by 

sees, intermixed with internal SeeS 
control lines. 

Detailed information about the contents of the various sections 
of the file may be found in sccsfile(5). The checksum is the 
only portion of the file that is of interest below. 

It is important to note that because SeeS files are ASeil files 
they may be processed by various UNIX software commands, 
such as ed(l), grep(l), and cat(l). This is very convenient in 
those instances in which an SeeS file must be modified 
manually (e.g., when the time and date of a delta was recorded 
incorrectly because the system clock was set incorrectly) or 
when it is desired to simply look at the file. 

eaution: Extreme care should be exercised when 
modifying SeeS files with non-SeeS commands. 



e. Auditing 

On rare occasions, perhaps due to an operating system or 
hardware malfunction, an SeeS file or portions of it (i.e., one 
or more "blocks") can be destroyed. The SeeS commands (like 
most UNIX software commands) issue an error message when a 
file does not exist. In addition, SeeS commands use the 
checksum stored in the SeeS file to determine whether a file 
has been corrupted since it was last accessed [possibly by 
having lost one or more blocks or by having been modified with 
ed(l)]. No sees command will process a corrupted SeeS file 
except the admin command with the — h or — z keyletters, as 
described below. 
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It is recommended that SCCS files be audited for possible 
corruptions on a regular basis. The simplest and fastest way to 
perform an audit is to execute the admin command with the 
-h keyletter on all SCCS files. 

admin -h s.filel s.file2 ... 

or 
admin -h directoryl directory2 ... 

If the new checksum of any file is not equal to the checksum in 
the first line of that file, the message 

corrupted file (co6) 

is produced for that file. This process continues until all the 
files have been examined. When examining directories (as in 
the second example above), the process just described will not 
detect missing files. A simple way to detect whether any files 
are missing from a directory is to periodically execute the ls(l) 
command on that directory and compare the outputs of the 
most current and the previous executions. Any file whose name 
appears in the previous output but not in the current one has 
been removed by some means. 

Whenever a file has been corrupted, the manner in which the 
file is restored depends upon the extent of the corruption. If 
damage is extensive, the best solution is to contact the local 
UNIX system operations group and request that the file be 
restored from a backup copy. In the case of minor damage, 
repair through use of the editor ed(l) may be possible. In the 
latter case after such repair, the following command must be 
executed: 

admin -z s.file 

The purpose of this is to recompute the checksum to bring it 
into agreement with the actual contents of the file. After this 
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command is executed on a file, any corruption that existed in 
the file will no longer be detectable. 



AN sees INTERFAeE PROGRAM 



A. General 

In order to permit UNIX system users [with different user 
identification numbers (user IDs)] to use SCCS commands upon 
the same files, an SCCS interface program is provided. It 
temporarily grants the necessary file access permissions to 
these users. This part discusses the creation and use of such an 
interface program. The SCCS interface program may also be 
used as a preprocessor to SCCS commands since it can perform 
operations upon its arguments. 



B. Function 

When only one user uses SCCS, the real and effective user IDs 
are the same; and that user's ID owns the directories 
containing SCCS files. However, there are situations (e.g., in 
large software development projects) in which it is practical to 
allow more than one user to make changes to the same set of 
SCCS files. In these cases, one user must be chosen as the 
"owner" of the SCCS files and be the one who will "administer" 
them (e.g., by using the admin command). This user is termed 
the "SCCS administrator" for that project. Since other users of 
SCCS do not have the same privileges and permissions as the 
SCCS administrator, the other users are not able to execute 
directly those commands that require write permission in the 
directory containing the SCCS files. Therefore, a project- 
dependent program is required to provide an interface to the 
get, delta, and if desired, rmdel, cdc, and unget commands. 
Other SCCS commands either do not require write permission 
in the directory containing SCCS files or are (generally) 
reserved for use only by the administrator. 
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The interface program 

• Must be owned by the SCCS administrator 

• Must be executable by the new owner 

• Must have the "set user on execution" bit "on" [see 
chmod(l) in the AT&T UNIX PC UNIX System V 
Manual]. 

Then when executed, the effective user ID is the user ID of the 
administrator. This program's function is to invoke the desired 
SCCS command and to cause it to inherit the privileges of the 
SCCS administrator for the duration of that command's 
execution. In this manner, the owner of an SCCS file (the 
administrator) can modify it at will. Other users whose login 
names are in the user list for that file (but who are not its 
owners) are given the necessary permissions only for the 
duration of the execution of the interface program. They are 
thus able to modify the SCCS files only through the use of 
delta and, possibly, rmdel and cdc. 

e. Basic Program 

When a UNIX system program is executed, the program is 
passed as argument 0, which is the name that invoked the 
program, and followed by any additional user-supplied 
arguments. Thus, if a program is given a number of links 
(names), the program may alter its processing depending upon 
which link invokes the program. This mechanism is used by an 
SCCS interface program to determine the SCCS command it 
should subsequently invoke [see exec(2) in the AT&T UNIX 
PC UNIX System V Manual]. 

A generic interface program (inter. c, written in C language) is 
shown in Figure 14-5. Note the reference to the (unsupplied) 
function "filearg". This is intended to demonstrate that the 
interface program may also be used as a preprocessor to SCCS 
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commands. For example, function "filearg" could be used to 
modify file arguments to be passed to the SCCS command by 
supplying the full pathname of a file, thus avoiding extraneous 
typing by the user. Also, the program could supply any 
additional (default) keyletter arguments desired. 



D. Linking and Use 

In general, the following demonstrates the steps to be 
performed by the SCCS administrator to create the SCCS 
interface program. It is assumed, for the purposes of the 
discussion, that the interface program inter.c resides in 
directory "/xl/xyz/sccs". Thus, the command sequence 

cd /xl/xyz/sccs 

cc ... inter.c -o inter ... 



compiles inter.c to produce the executable module inter (the 
"..." represents other arguments that may be required). The 
proper mode and the "set user ID on execution" bit are set by 
executing 

chmod 4755 inter 

For example, new links are created by 

In inter get 
In inter delta 
In inter rmdel 

The names of the links may be arbitrary if the interface 
program is able to determine from them the names of SCCS 
commands to be invoked. Subsequently, any user whose shell 
parameter PATH [see sh(l) in the AT&T UNIX PC UNIX 
Systen V Manual] specifies directory *Vxl/xyz/sccs" as the one 
to be searched first for executable commands may execute 
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get -e /xl/xyz/sccs/s.abc 

from any directory to invoke the interface program (via its link 
"get"). The interface program then executes "/usr/bin/get" 
(the actual SCCS get command) upon the named file. As 
previously mentioned, the interface program could be used to 
supply the pathname "/xl/xyz/sccs" so that the user would 
only have to specify 

get -e s.abc 

to achieve the same results. 
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Chapter 15 
THE "m4" MACRO PROCESSOR 

GENERAL 

The m4 macro processor is a front end for rational Fortran 
(Ratfor) and the C programming languages. The "#define" 
statement in C language and the analogous "define" in Ratfor 
are examples of the basic facility provided by any macro 
processor. 

At the beginning of a program, a symbolic name or symbolic 
constant can be defined as a particular string of characters. 
The compiler will then replace later unquoted occurrences of 
the symbolic name with the corresponding string. Besides the 
straightforward replacement of one string of text by another, 
the m4 macro processor provides the following features: 

• arguments 

• arithmetic capabilities 

• file manipulation 

• conditional macro expansion 

• string and substring functions. 

The basic operation of m4 is to read every alphanumeric token 
(string of letters and digits) input and determine if the token is 
the name of a macro. The name of the macro is replaced by its 
defining text, and the resulting string is pushed back onto the 
input to be rescanned. Macros may be called with arguments. 
The arguments are collected and substituted into the right 
places in the defining text before the defining text is rescanned. 
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The user also has the capability to define new macros. Built-ins 
and user-defined macros work exactly the same way except that 
some of the built-in macros have side effects on the state of the 
process. A list of 21 built-in macros provided by the m4 macro 
processor can be found in Figure 15-1. 



Macro Function 
Name 


changequote 


Restores original 
characters or 
makes new quote 
characters the 
left and right 
brackets. 


changescom 


Changes left and right 
comment markers from 
the default # and new 
line. 


deer 


Returns the value of 

its argument decremented 

byl. 


define 


Defines new macros. 


defn 


Returns the quoted 
definition of its 
argument(s). 


divert 


Diverts output to 

1-out-of-lO 

diversions. 



Figure 15-1. Built-in Macros (Sheet 1 of 4) 
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Macro 


Function 


Name 




divnum 


Returns the number 




of the currently 




active diversion. 


dnl 


Reads and discards 




characters up to 




and including the 




next new line. 


dumpdef 


Dumps the current 




names and definitions 




of items named as 




arguments. 


errprint 


Prints its arguments 




on the standard 




error file. 


eval 


Prints arbitrary 




arithmetic on 




integers. 


ifdef 


Determines if a 




macro is currently 




defined. 


ifelse 


Performs arbitrary 




conditional testing. 


include 


Returns the contents 




of the file named 




in the argument. A 




fatal error occurs 




if the file name 




cannot be accessed. 



Figure 15-1. Built-in Macros (Sheet 2 of 4) 
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Macro 


Function 


Name 




iner 


Returns the value of 




its argument 




incremented by 1. 


index 


Returns the position 




where the second 




argument begins in 




the first argument 




pf index. 


len 


Returns the number of 




characters that makes 




its argument. 


m4exit 


Causes immediate 




exit from m4. 


m4wrap 


Pushes the exit code 




back at final EOF. 


maketemp 


Facilitates making 




unique file names. 


popdef 


Removes current 




definition of its 




argument(s) 




exposing any previous 




definitions. 


pushdef 


Defines new macros 




but saves any 




previous definition. 



Figure 15-1. Built-in Macros (Sheet 3 of 4) 
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Macro 


Function 


Name 




shift 


Returns all arguments 




of shift except the 




first argument. 


sinclude 


Returns the contents 




of the file named 




in the arguments. 




The macro remains 




silent and continues 




if the file is 




inaccessible. 


substr 


Produces substrings 




of strings. 


syscmd 


Executes the UNIX System 




command given in 




the first argument. 


traceoff 


Turns macro trace off. 


traceon 


Turns the macro trace on. 


translit 


Performs character 




transliteration. 


undefine 


Removes user-defined 




or built-in macro 




definitions. 


undivert 


Discards the diverted 




text. 



Figure 15-1. Built-in Macros (Sheet 4 of 4) 



To use the m4 macro processor, input the following command: 
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m4 [optional files] 



Each argument file is processed in order. If there are no 
arguments or if an argument is "-", the standard input is read 
at that point. The processed text is written on the standard 
output which may be captured for subsequent processing with 
the following input: 

m4 [files] >outputfile 



DEFINING MACROS 

The primary built-in function of m4 is define. Define is used 
to define new macros. The following input: 



define(name, stuff) 



causes the string name to be defined as stuff. All subsequent 
occurrences of name will be replaced by stuff Name must be 
alphanumeric and must begin with a letter (the underscore 
counts as a letter). Stuff is any text that contains balanced 
parentheses. Use of a backslash may stretch stuff over multiple 
lines. Thus, as a typical example, 

define(N, 100) 
if (i > N) 



defines A^ to be 100 and uses the symbolic constant A^ in a later 
if statement. 
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The left parenthesis must immediately follow the word define 
to signal that define has arguments. If a user-defined macro 
or built-in name is not followed immediately by "(", it is 
assumed to have no arguments. Macro calls have the following 
general form: 

name(argl,arg2,...argn) 

A macro name is only recognized as such if it appears 
surrounded by nonalphanumerics. Using the following example: 

define(N, 100) 
if (NNN > 100) 



the variable NNN is absolutely unrelated to the defined macro 
A^ even though the variable contains a lot of As. 

Macros may be defined in terms of other names. For example, 

define(N, 100) 
define(M, N) 



defines both M and N to be 100. If N is redefined and 
subsequently changes, M retains the value of 100 not N. 

The m4 macro processor expands macro names into their 
defining text as soon as possible. The string N is immediately 
replaced by 100. Then the string M is also immediately 
replaced by 100. The overall result is the same as using the 
following input in the first place: 

define(M, 100) 
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The order of the definitions can be interchanged as follows: 



define(M, N) 
define(N, 100) 



Now M is defined to be the string A^, so when the value of M is 
requested later, the result is the value of N at that time 
(because the M will be replaced by A^ which will be replaced by 
100). 

The more general solution is to delay the expansion of the 
arguments of define by quoting them. Any text surrounded by 
left and right single quotes is not expanded immediately but 
has the quotes stripped off. The value of a quoted string is the 
string stripped of the quotes. If the input is 

define(N, 100) 
define(M, 'N') 



the quotes around the N are stripped off as the argument is 
being collected. The results of using quotes is to define M as 
the string A^, not 100. The general rule is that m4 always 
strips off one level of single quotes whenever it evaluates 
something. This is true even outside of macros. If the word 
define is to appear in the output, the word must be quoted in 
the input as follows: 

'define' = 1; 

Another example of using quotes is redefining N. To redefine 
A^, the evaluation must be delayed by quoting 
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define(N, 100) 
defineCN', 200) 



In m4, it is often wise to quote the first argument of a macro. 
The following example will not redefine N: 



define(N, 100) 
define(N, 200) 



The N in the second definition is replaced by 100. The result is 
equivalent to the following statement: 

define(100, 200) 



This statement is ignored by m4 since only things that look 
like names can be defined. 

If left and right single quotes are not convenient for some 
reason, the quote characters can be changed with the following 
built-in macro: 



changequote([, ]) 



The built-in changequote makes the new quote characters the 
left and right brackets. The original characters can be restored 
by using changequote without arguments as follows: 

changequote 
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There are two additional built-ins related to define. The 
undefine macro removes the definition of some macro or 
built-in as follows: 



undefine(*N' 



The macro removes the definition of N. Built-ins can be 
removed with undefine, as follows: 



undefine( 'define') 



But once removed, the definition cannot be reused. 

The built-in ifdef provides a way to determine if a macro is 
currently defined. Depending on the system, a definition 
appropriate for the particular machine can be made as follows: 

ifdef('pdpll', 'define(wordsize,16)') 
ifdef('u3b', 'define(wordsize,32)') 



Remember to use the quotes. 

The ifdef macro actually permits three arguments. If the first 
argument is defined, the value of ifdef is the second argument. 
If the first argument is not defined, the value of ifdef is the 
third argument. If there is no third argument, the value of 
ifdef is null. If the name is undefined, the value of ifdef is 
then the third argument, as in 

ifdefCunix', on UNIX, not on UNIX) 
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ARGUMENTS 

So far the simplest form of macro processing has been 
discussed which is replacing one string by another (fixed) 
string. User-defined macros may also have arguments, so 
different invocations can have different results. Within the 
replacement text for a macro (the second argument of its 
define), any occurrence of $n is replaced by the nth argument 
when the macro is actually used. Thus, the macro bump 
defined as 

define(bump, $1 = $1 + 1) 



generates code to increment its argument by 1. The 'bump(x)' 
statement is equivalent to 'x = x + 1.' 

A macro can have as many arguments as needed, but only the 
first nine are accessible ($1 through $9). The macro name is 
$0 although that is less commonly used. Arguments that are 
not supplied are replaced by null strings, so a macro can be 
defined which simply concatenates its arguments like this: 

define(cat, $1$2$3$4$5$6$7$8$9) 



Thus, *cat(x, y, z)' is equivalent to *xyz'. Arguments $4 through 
$9 are null since no corresponding arguments were provided. 
Leading unquoted blanks, tabs, or newlines that occur during 
argument collection are discarded. All other white space is 
retained. Thus: 

define(a, b c) 



defines 'a' to be 'b c'. 
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Arguments are separated by commas; however, when commas 
are within parentheses, the argument is not terminated nor 
separated. For example, 

define(a, (b,c)) 



has only two arguments. The first argument is a. The second 
is literally (b,c). A bare comma or parenthesis can be inserted 
by quoting it. 



ARITHMETIC BUILT-INS 

The m4 provides three built-in functions for doing arithmetic 
on integers (only). The simplest is incr which increments its 
numeric argument by 1. The built-in deer decrements by 1. 
Thus to handle the common programming situation where a 
variable is to be defined as "one more than N", use the 
following: 

define(N, 100) 
define(Nl, 'incr(N)') 



Then Nl is defined as one more than the current value of N. 

The more general mechanism for arithmetic is a built-in called 
eval which is capable of arbitrary arithmetic on integers. The 
operators in decreasing order of precedence are 
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unary + and - 

** or (exponentiation) 

* / % (modulus) 

+ - 

== != < <= > >= 

! (not) 

& or && (logical and) 

I or I I (logical or). 

Parentheses may be used to group operations where needed. 
All the operands of an expression given to eval must 
ultimately be numeric. The numeric value of a true relation 
(like 1>0) is 1 and false is 0. The precision in eval is 32 bits 
under the UNIX operating system. 

As a simple example, define M to be "2==N+1" using eval as 
follows: 



define(N, 3) 

define(M, 'eval(2==N+l)') 



The defining text for a macro should be quoted unless the text 
is very simple. Quoting the defining text usually gives the 
desired result and is a good habit to get into. 



FILE MANIPULATION 

A new file can be included in the input at any time by the 
built-in function include. For example, 



include(filename) 



inserts the contents of filename in place of the include 
command. The contents of the file is often a set of definitions. 
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The value of include (include's replacement text) is the 
contents of the file. If needed, the contents can be captured in 
definitions, etc. 

A fatal error occurs if the file named in include cannot be 
accessed. To get some control over this situation, the alternate 
form sinclude can be used. The built-in sinclude (silent 
include) says nothing and continues if the file named cannot be 
accessed. 

The output of m4 can be diverted to temporary files during 
processing, and the collected material can be output upon 
command. The m4 maintains nine of these diversions, 
numbered 1 through 9. If the built-in macro 

divert(n) 



is used, all subsequent output is put onto the end of a 
temporary file referred to as n. Diverting to this file is stopped 
by the divert or divert(O) command which resumes the 
normal output process. 

Diverted text is normally output all at once at the end of 
processing with the diversions output in numerical order. 
Diversions can be brought back at any time by appending the 
new diversion to the current diversion. Output diverted to a 
stream other than through 9 is discarded. The built-in 
undivert brings back all diversions in numerical order. The 
built-in undivert with arguments brings back the selected 
diversions in the order given. The act of undiverting discards 
the diverted text (as does diverting) into a diversion whose 
number is not between and 9, inclusive. 

The value of undivert is not the diverted text. Furthermore, 
the diverted material is not rescanned for macros. The built-in 
divnum returns the number of the currently active diversion. 
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The current output stream is zero during normal processing. 



SYSTEM COMMAND 

Any program in the local operating system can be run by using 
the syscmd built-in. For example, 



syscmd(date) 



on the UNIX system runs the date command. Normally, 
syscmd would be used to create a file for a subsequent 
include. To facilitate making unique file names, the built-in 
maketemp is provided with specifications identical to the 
system function mktemp. The maketemp macro fills in a 
string of XXXXX in the argument with the process ID of the 
current process. 



CONDITIONALS 

Arbitrary conditional testing is performed via built-in ifelse. 
In the simplest form 



ifelse(a, b, c, d) 



compares the two strings a and b. If a and b are identical, 
ifelse returns the string c. Otherwise, string d is returned. 
Thus, a macro called compare can be defined as one which 
compares two strings and returns "yes" or "no" if they are the 
same or different as follows: 

define(compare, 'ifelse($l, $2, yes, no)') 
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Note the quotes which prevent evaluation of ifelse occurring 
too early. If the fourth argument is missing, it is treated as 
empty. 

The built-in ifelse can actually have any number of arguments 
and provides a limited form of multiway decision capability. In 
the input 

ifelse(a, b, c, d, e, f, g) 



if the string a matches the string b, the result is c. Otherwise, 
if d is the same as e, the result is /. Otherwise, the result is g. 
If the final argument is omitted, the result is null, so 

ifelse(a, b, c) 
is c if a matches b, and null otherwise. 

STRING MANIPULATION 

The built-in len returns the length of the string (number of 
characters) that makes up its argument. Thus: 

len(abcdef) 

is 6, and len((a,b)) is 5. 

The built-in substr can be used to produce substrings of 
strings. Using input, substr (s, i, n) returns the substring of s 
that starts at the *th position (origin zero) and is n characters 
long. If n is omitted, the rest of the string is returned. 
Inputting 
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substr(*now is the time',1) 

returns the following string: 
ow is the time. 

If i or n are out of range, various actions occur. 

The built-in index(sl, s2) returns the index (position) in si 
where the string s2 occurs or -1 if it does not occur. As with 
substr, the origin for strings is 0. 

The built-in translit performs character transliteration and 
has the general form 

translit(s, f, t) 



which modifies s by replacing any character found in / by the 
corresponding character of t. Using input 

translit(s, aeiou, 12345) 



replaces the vowels by the corresponding digits. If t is shorter 
than /, characters that do not have an entry in t are deleted. As 
a limiting case, if t is not present at all, characters from / are 
deleted from s. So 

translit(s, aeiou) 



would delete vowels from s. 
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There is also a built-in called dnl that deletes all characters 
that follow it up to and including the next new line. The dnl 
macro is useful mainly for throwing away empty lines that 
otherwise tend to clutter up m4 output. Using input 

define(N, 100) 
define(M, 200) 
define(L, 300) 



results in a new line at the end of each line that is not part of 
the definition. So the new line is copied into the output where 
it may not be wanted. If the built-in dnl is added to each of 
these lines, the newlines will disappear. Another method of 
achieving the same results is to input 

divert(-l) 
define(...) 

divert. 



PRINTING 

The built-in errprint writes its arguments out on the standard 
error file. An example would be 



errprint( 'fatal error') 

The built-in dumpdef is a debugging aid that dumps the 
current names and definitions of items named as arguments. If 
no arguments are given, then all current names and definitions 
are printed. Do not forget to quote the names. 
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Chapter 16 

THE "awk" PROGRAMMING 
LANGUAGE 



GENERAL 

The awk is a file-processing programming language designed 
to make many common information and retrieval text 
manipulation tasks easy to state and perform. The awk: 

• Generates reports 

• Matches patterns 

• Validates data 

• Filters data for transmission. 



PROGRAM STRUCTURE 

The awk program is a sequence of statements of the form 



pattern {action} 
pattern {action} 



The awk program is run on a set of input files. The basic 
operation of awk is to scan a set of input lines, in order, one at 
a time. In each line, awk searches for the pattern described in 
the awk program, then if that pattern is found in the input 
line, a corresponding action is performed. In this way, each 
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statement of the awk program is executed for a given input 
line. When all the patterns are tested, the next input line is 
fetched; and the awk program is once again executed from the 
beginning. 

In the awk command, either the pattern or the action is 
omitted, but not both. If there is no action for a pattern, the 
matching line is simply printed. If there is no pattern for an 
action, then the action is performed for every input line. The 
null awk program does nothing. Since patterns and actions 
are both optional, actions are enclosed in braces to distinguish 
them from patterns. 



For example, this awk program 
/x/ {print} 
prints every input line that has an " x" in it. 

An awk program has the following structure: 



- a <BEGIN> section 

- a <record> or main section 

- an <END> section. 



The <BEGIN> section is run before any input lines are read, 
and the <END> section is run after all the data files are 
processed. The <record> section is data driven. That is, it is the 
section that is run over and over for each separate line of input. 

Values are assigned to variables from the awk command line. 
The <BEGIN> section is run before these assignments are 
made. 
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The words "BEGIN" and "END" are actually patterns 
recognized by awk. These are discussed further in the pattern 
section of this guide. 



LEXICAL CONVENTION 

All awk programs are made up of lexical units called tokens. 
In awk there are eight token types: 



1. numeric constants 

2. string constants 

3. keywords 

4. identifiers 

5. operators 

6. record and file tokens 

7. comments 

8. separators. 

Numeric Constants 

A numeric constant is either a decimal constant or a floating 
constant. A decimal constant is a nonnull sequence of digits 
containing at most one decimal point as in 12, 12., 1.2, and 
.12. A floating constant is a decimal constant followed by e or 
E followed by an optional + or - sign followed by a nonnull 
sequence of digits as in 12e3, 1.2e3, 1.2e— 3, and 1.2E+3. 
The maximum size and precision of a numeric constant are 
machine dependent. 
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String Constants 

A string constant is a sequence of zero or more characters 
surrounded by double quotes as in " ," " a" , " ab" , and " 12" . 
A double quote is put in a string by proceeding it with \ as in 
" He said, \ Sit! \" " . A newline is put in a string by using \n in 
its place. No other characters need to be escaped. Strings can 
be (almost) any length. 



Keywords 



Strings used as keywords are shown in Figure 16-1. 



Keywords 


begin 


break 


length 


end 


close 


log 


FILENAME 


continue 


next 


FS 


close 


number 


NF 


exit 


print 


NR 


exp 


printf 


OFS 


for 


split 


ORS 


getline 


sprintf 


OFMT 


if 


sqrt 


RS 


in 


string 




index 


substr 




int 


while 



Figure 16-1. Strings Used as Keywords 
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Identifiers 

Identifiers in awk serve to denote variables and arrays. An 
identifier is a sequence of letters, digits, and underscores, 
beginning with a letter or an underscore. Uppercase and 
lowercase letters are different. 



Operators 

The awk has assignment, arithmetic, relational, and logical 
operators similar to those in the C programming language and 
regular expression pattern matching operators similar to those 
in the UNIX operating system program egrep and lex. 
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Assignment operators are shown in Figure 16-2. 



Assignment Operators 


Symbol 


Usage 


Description 


+= 


assignment 
plus-equals 


X -l-= Y is similar 
to X = X+Y 


-= 


minus-equals 


X-=Y is similar 
to X = X-Y 


*= 


times-equals 


X *= Y is similar 
to X = X*Y 


/= 


divide-equals 


X = Y is similar 
to X = X/Y 


%= 


mod-equals 


X %= Y is similar 
to X = X%Y 


++ 


prefix and 

postfix 

increments 


-1— l-X and FBX+-I- are similar 
to X-X-hl 




prefix and 

postfix 

decrements 


— and X similar 
to X = X - 1 



Figure 16-2. Symbols and Descriptions for Assignment 
Operators 
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Arithmetic operators are shown in Figure 16-3. 



Arithmetic Operators 


Symbol 
.R 


Description 


+ 

* 

1 

% 

(...) 


unary binary plus 

unary and binary minus 

multiplication 

division 

modulus 

grouping 



Figure 16-3. Symbols and Descriptions for Arithmetic 
Operators 
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Relational operators are shown in Figure 16-4. 



Relational Operators 


Symbol 


Description 


< 

>= 
> 


•less than 

less than or equal to 
equal to 
not equal to 

greater than or equal to 
greater than 



Figure 16-4. Symbols and Descriptions for Relational 
Operators 
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Logical operators are shown in Figure 16-5. 



Logical Operators 


Symbol 


Description 


&& 

J J 
J 


and 

or 

not 



Figure 16-5. Symbols and Descriptions for Logical 
Operators 



Regular expression matching operators are shown in the Figure 16-6. 



Regular Expression 


Pattern Matching Operators 


Symbol 


Description 


T- 


matches 
does not match 



Figure 16-6. Symbols and Descriptions for Regular 
Expression Pattern 
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Record and Field Tokens 

The $0 is a special variable whose value is that of the current 
input record. The $1, $2... are special variables whose values 
are those of the first field, the second field , . . . , respectively, 
of the current input record. The keyword NF (Number of 
Fields) is a special variable whose value is the number of fields 
in the current input records. Thus $NF has, as its value, the 
value of the last field of the current input records. Notice that 
the field of each record is numbered 1 and that the number of 
fields can vary from record to record. None of these variables is 
defined in the action associated with a BEGIN or END pattern, 
where there is no current input record. 

The keyword NR (Number of Records) is a variable whose 
value is the number of input records read so far. The first 
input record read is 1. 



Record Separators 

The keyword RS (Record Separators) is a variable whose value 
is the current record separator. The value of RS is initially set 
to newline, indicating that adjacent input records are separated 
by a newline. Keyword RS is changed to any character c by 
including the assignment statement RS = " c" in an action. 



Field Separator 

The keyword FS (Field Separator) is a variable indicating the 
current field separator. Initially, the value of FS is a blank, 
indicating that fields are separated by white space, i.e., any 
nonnull sequence of blanks and tabs. Keyword FS is changed to 
any single character c by including the assignment statement F 
= " c" in an action or by using the optional command line 
argument — Fc. Two values of c have special meaning, space 
and t. The assignment statement FS = " " makes white space 
in field separator; and on the command line, —Ft makes tab the 
field separator. 
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If the field separator is not a blank, then there is a field in the 
record on each side of the separator. For instance, if the field 
separator is 1, the record IXXXl has three fields. The first 
and last are null. If the field separator is blank, then fields are 
separated by white space, and none of the NF fields are null. 



Multiline Records 

The assignment RS = " " makes an empty line the record 
separator and makes a nonnull sequence (consisting of blanks, 
tabs, and possibly a newline) the field separator. With this 
setting, none of the first NF fields of any record are null. 



Output Record and Field Separators 

The value of OFS (Output Field Separator) is the output field 
separator. It is put between fields by print. The value of ORS 
(Output Record Separators) is put after each record by print. 
Initially, ORS is set to a newline and OFS to a space. These 
values may change to any string by assignments such as ORS 
= " abc" and OFS = " xyz" . 



Comments 

A comment is introduced by a # and terminated by a newline. 
For example: 

# part of the line is a comment 

A comment can be appended to the end of any line of an awk 
program. 



Separators and Brackets 

Tokens in awk are usually separated by nonnull sequences of 
blank, tabs, and newlines, or by other punctuation symbols such 
as commas and semicolons. Braces {...} surround actions, 
slashes /.../ surround regular expression patterns, and double 
quotes " ..." surround strings. 
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PRIMARY EXPRESSIONS 

In awk, patterns and actions are made up of expressions. The 
basic building blocks of expressions are the primary 
expressions: 

numeric constants 
string constant 
var 
function 

Each expression has both a numeric and a string value, one of 
which is usually preferred. The rules for determining the 
preferred value of an expression are explained below. 



Numeric Constants 

The format of a numeric constant was defined previously in 
LEXICAL CONVENTIONS. Numeric values are stored as 
floating point numbers. Both the numeric and string value of a 
numeric constant is the decimal number represented by the 
constant. The preferred value is the numeric value. 
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Numeric values for string constants are in Figure 16-7. 



Numeric Constants 


Numeric 


Numeric 


String 


Constant 


Value 


Value 











1 


1 


1 


.5 


0.5 


.5 


.5e2 


50 


50 



Figure 16-7. Numeric Values for String Constants 



String Constants 

The format of a string constant was defined previously in 
LEXICAL CONVENTIONS. The numeric value of a string 
constant is unless the string is a numeric constant enclosed 
in double quotes. In this case, the numeric value is the number 
represented. The preferred value of a string constant is its 
string value. The string value of a string constant is always 
the string itself. 
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String values for string constants are in Figure 16-8. 



String Constants 


String 


Numeric 


String 


Constant 


Value 


Value 


>»> 





empty space 


"a" 





a 


" XYZ" 





xyz 


"o" 








I1 1 1I 


1 


1 


" .5" 


0.5 


.5 


" .5e2" 


0.5 


.5e2a 



Figure 16-8. String Values for String Constants 



Vars 

A var is one of the following: 

identifier 

identifier { expression } 

^term 

The numeric value of any uninitialized var is 0, and the string 
value is the empty string. 

An identifier by itself is a simple variable. A var of the form 
identifier {expression} represents an element of an associative 
array named by identifier. The string value of expression is 
used as the index into the array. The preferred value of 



16-14 



AWK 



identifier or identifier {expression} is determined by context. 

The var $0 refers to the current input record. Its string and 
numeric values are those of the current input record. If the 
current input record represents a number, then the numeric 
value of $0 is the number and the string value is the literal 
string. The preferred value of $0 is string unless the current 
input record is a number. The $0 cannot be changed by 
assignment. 

The var $1, $2, . . . refer to fields 1, 2, ... of the current input 
record. The string and numeric value of $i for l<=i<=NF are 
those of the ith. field of the current input record. As with $0, if 
the *th field represents a number, then the numeric value of $i 
is the number and the string value is the literal string. The 
preferred value of $i is string unless the ith field is a number. 
The $i is changed by assignment. The $0 is then changed 
accordingly. 

In general, $term refers to the input record if term has the 
numeric value and to field i if the greatest integer in the 
numeric value of term is i. If i<0 or if i>=100, then accessing 
$i causes awk to produce an error diagnostic. If NF<i<=100, 
then $i behaves like an uninitialized var. Accessing $i for i > 
NF does not change the value of NF. 



Function 

The awk has a number of built-in functions that perform 
common arithmetic and string operations. 
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The arithmetic functions are in Figure 16-9. 



Functions 



exp 
int 
log 

sqrt 



(expression) 
(expression) 
(expression) 
(expression) 



Figure 16-9. Built-in Functions 
String Operations 



for Arithmetic and 



These functions (exp, int, log, and sqrt) compute the 
exponential, integer part, natural logarithm, and square root, 
respectively, of the numeric value of expression. The 
(expression) may be omitted; then the function is applied to $0. 
The preferred value of an arithmetic function is numeric. 
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String functions are shown in Figure 16-10. 



String Functions 


getline 




index 


(expressionl, expression2) 


length 


(expression) 


split 


(expression, identifier, expression2) 


split 


(expression, identifier) 


sprintf 


(format, expressionl, expression2...) 


substr 


(expressionl, expression2) 


substr 


(expressionl, expression2, expressions) 



Figure 16-10. Expressions for String Functions 



The function getline causes the next input record to replace the 
current record. It returns 1 if there is a next input record or a 
if there is no next input record. The value of NR is updated. 

The function index (el,e2) takes the string value of expressions 
el and e2 and returns the first position of where e2 occurs as a 
substring in el. If e2 does not occur in el, index returns 0. For 
example, index (" abc" , " be" )=2 and index (" abc" , 
" ac" )=0. 

The function length without an argument returns the number 
of characters in the current input record. With an expression 
argument, length (e) returns the number of characters in the 
string value of e. For example, length (" abc" )=3 and length 
(17)=2. 



The function split (e array, sep) splits the string value of 
expression e into fields that are then stored in array [1], array 
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[2],..., array [n] using the string value of sep as the field 
separator. Split returns the number of fields found in e. The 
function split (e, array) uses the current value of FS to indicate 
the field separator. For example, after invoking n = split ($0), 
a[l], a [2],..., a[n] is the same sequence of values as $1, $2 . . ., 
$NF. 

The function splitf (f, el, e2 . . .) produces the value of 
expressions el, e2 . . . in the format specified by the string 
value of the expression f. The format control conventions are 
those of the printf statement in the C programming language 

[KR]. 

The function substr (string, pos) returns the suffix of string 
starting at position pos. The function substr (string, pos, 
length) returns the substring of string that begins at position 
pos and is length characters long. If pos + length is greater 
than the length of string then substr (string, pos, length) is 
equivalent to substr (string, pos). For example, substr (" abc" , 
2, 1) = " b" , substr (" abc" , 2, 2) = " be" , and subtr (" abc" , 
2, 3) = " be" . Positions less than 1 are taken as 1. A negative 
or zero length produces a null result. 

The preferred value of sprintf and substr is string. The 
preferred value of the remaining string functions is numeric. 



TERMS 

Various arithmetic operators are applied to primary 
expressions to produce larger syntactic units called terms. All 
arithmetic is done in floating point. A term has one of the 
following forms: 
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primary expression 
term binop term 
unop term 
incremented var 
(term) 



Binary Terms 

In a term of the form 



terml 
binop 
term2 

binop can be one of the five binary arithmetic operators +, -, * 
(multiplication), / (division), % (modulus). The binary operator 
is applied to the numeric value of the operand terml and term2, 
and the result is the usual numeric value. This numeric value is 
the preferred value, but it can be interpreted as a string value 
(see Numeric Constants). The operators * , /, and % have 
higher precedence than + and -. All opef^rators are left 
associative. 



Unary Term 

In a term of the form 

unop term 

unop can be unary + or -. The unary operator is applied to the 
numeric value of term, and the result is the usual numeric 
value which is preferred. However, it can be interpreted as a 
string value. Unary + and - have higher precedence than *, /, 
and %. 
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Incremented Vars 

An incremented var has one of the forms 

+ + var 
— var 

var + + 

var - - 

The + + var has the value var + 1 and has the effect of var = 
var + 1. Similarly, — var has the value var - 1 and has the 
effect of var = var - 1. Therefore, var + + has the same value 
as var and has the effect of var = var + 1. Similarly, var — 
has the same value as var and has the effect of var = var - 1. 
The preferred value of an incremented var is numeric. 

Parenthesized Terms 

Parentheses are used to group terms in the usual manner. 



EXPRESSIONS 

An awk expression is one of the following: 

term 

term term ... 

var asgnop expression 

Concatenation of Terms 

In an expression of the form terml term2 ..., the string value of 
the terms are concatenated. The preferred value of the 
resulting expression is a string value that can be interpreted as 
a numeric value. Concatenation of terms has lower precedence 
than binary + and -. For example, 1+2 3=4 has the string (and 
numeric) value 37. 
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Assignment Expressions 

An assignment expression is one of the forms 

var asgnop expression 
where asgnop is one of the six assignment operators: 



% = 



The preferred value of var is the same as that of expression. 

In an expression of the form 

var — expression 

the numeric and string values of var become those of 
expression. 

var op = expression 

is equivalent to 

var = var op expression 

where op is one of; +, -, *, /, %. The asgnops are right 
associative and have the lowest precedence of any operator. 
Thus, a += b *== c-2 is equivalent to the sequence of 
assignments 
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b = b * (0-2) 
a = a+2 



USING awk 

There are two ways in which to present your awk program of 
pattern-action statements to awk for processing: 



1. If the program is short (a line or two), it is often easiest to 
make the program the first argument on the command line: 

awk ' program ' files 

where " files" is an optional list of input files and 
" program" is your awk program. Note that there are 
single quotes around the program in order for the shell to 
accept the entire string (program) as the first argument to 
awk. For example, write to the shell 

awk ' /x/ (print } ' files 

to run the awk script /x/ (print) on the input file " files" . 
If no input files are specified, awk takes input from the 
standard input stdin. You can also specify that input 
comes from stdin by using " -" (the hyphen) as one of the 
files. The pattern-action statement 

awk 'program' files - 

looks for input from " files" and from stdin and processes 
first from " files" and then from stdin. 
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2. Alternately, if your awk program is long, it is more 
convenient to put the program in a separate file, awkprog, 
and tell awk to fetch it from there. This is done by using 
the " -f " option after the awk command as follows: 

awk -f awkprog files 

where " files" is an optional list of input files that may 
include stdin as is indicated by a hyphen (-). 

For example: 

awk ' BEGIN { 

print " hello, world" 
exit 



prints 

hello, world 

on the standard output when given to the shell. Recall that the 
word " BEGIN" is a special pattern indicating that the action 
following in braces is run before any data is read. Words 
" print" and " exit" are both discussed in later sections. 

This awk program could be run by putting 

BEGIN { 

print " hello, world" 
exit 

} 

in a file named awkprog , and then the command 
awk -f awkprog 
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given to the shell. This would have the same effect as the first 
procedure. 



INPUT: RECORDS AND FIELDS 

The awk reads its input one record at a time unless changed 
by you. A record is a sequence of characters from the input 
ending with a newline character or with an end of file. Thus, a 
record is a line of input. The awk program reads in characters 
until it encounters a newline or end of file. The string of 
characters, thus read, is assigned to the variable $0. You can 
change the character that indicates the end of a record by 
assigning a new character to the special variable RS (the 
record separator). Assignment of values to variables and these 
special variables such as RS are discussed later. 

Once awk has read in a record, it then splits the record into 
" fields" . A field is a string of characters separated by blanks 
or tabs, unless you specify otherwise. You may change field 
separators from blanks or tabs to whatever characters you 
choose in the same way that record separators are changed. 
That is, the special variable FS is assigned a different value. 

As an example, let us suppose that the file " countries" contains 
the area in thousands of square miles, the population in 
millions, and the continent for the ten largest countries in the 
world. (Figures are from 1978; Russia is placed in Asia.) 
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Sample Input File " countries" : 



Russia 


8650 


262 


Asia 


Canada 


3852 


24 


North America 


China 


3692 


866 


Asia 


USA 


3615 


219 


North America 


Brazil 


3286 


116 


South America 


Australia 


68 


14 


Australia 


India 


1269 


637 


Asia 


Argentina 


72 


26 


South America 


Sudan 


968 


19 


Africa 


Algeria 


920 


18 


Africa 



The wide spaces are tabs in the original input and a single 
blank separates North and South from America. We use this 
data as the input for many of the awk programs in this guide 
since it is typical of the type of material that awk is best at 
processing (a mixture of words and numbers separated into 
fields or columns separated by blanks and tabs). 

Each of these lines has either four or five fields if blanks 
and/or tabs separate the fields. This is what awk assumes 
unless told otherwise. In the above example, the first record is 



Russia 8650 262 Asia 



When this record is read by awk, it is assigned to the variable 
$0. If you want to refer to this entire record, it is done through 
the variable, $0. 



For example, the following input: 
{print $0} 
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prints the entire record. Fields within a record are assigned to 
the variables $1, $2, $3, and so forth; that is, the first field of 
the present record is referred to as $1 by the awk program. 
The second field of the present record is referred to as $2 by 
the awk program. The ^'th field of the present record is referred 
to as $i by the awk program. Thus, in the above example of the 
file countries, in the first record; 



$1 is equal to the string " Russia" 
$2 is equal to the integer 8650 
$3 is equal to the integer 262 
$4 is equal to the string " Asia" 
$5 is equal to the null string 
... and so forth. 



To print the continent, followed by the name of the country, 
followed by its population, use the following awk script: 

{print $4, $1, $3} 
Note that awk does not require type declarations. 



INPUT: FROM THE COMMAND LINE 

It is possible to assign values to variables from within an awk 
program. Because you do not declare types of variables, a 
variable is created simply by referring to it. An example of 
assigning a value to a variable is: 

x=5 

This statement in an awk program assigns the value 5 to the 
variable x. It is also possible to assign values to variables from 
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the command line. This provides another way to supply input 
values to awk programs. 

For example 

awk ' {print x }' x=5 - 

will print the value 5 on the standard output. The minus sign at 
the end of this command is necessary to indicate that input is 
coming from stdin instead of a file called " x=5" . Similarly if 
the input comes from a file named " file" , the command is 

awk '{print x}' file 

It is not possible to assign values to variables used in the 
BEGIN section in this way. 

If it is necessary to change the record separator and the field 
separator, it is useful to do so from the command line as in the 
following example: 

awk -f awk.program RS=" :" file 

Here, the record separator is changed to the character " :" . 
This causes your program in the file " awk.program" to run 
with records separated by the colon instead of the newline 
character and with input coming from the file, " file" . It is 
similarly useful to change the field separator from the 
command line. 

This operation is so common that there is yet another way to 
change the field separator from the command line. There is a 
separate option " -Px" that is placed directly after the 
command awk. This changes the field separator from blank or 
tab to the character " x" . 

For example 
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awk -F: -f awk.program file 

changes the field separator FS to the character " :" , Note that 
if the field separator is specifically set to a tab, (that is, with 
the -F option or by making a direct assignment to FS) then 
blanks are recognized by awk as separating fields. However, 
even if the field separator is specifically set to a blank, tabs are 
STILL recognized by awk as separating fields. 



An exercise: 

Using the input file (" countries" described earlier) write an 
awk script that prints the name of a country followed by the 
continent that it is on. Do this in such a way that continents 
composed of two words (e. g., North America) are processed as 
only one field and not two. 



OUTPUT: PRINTING 

An action may have no pattern; in this case, the action is 
executed for all lines as in the simple printing program 

{print} 

This is one of the simplest actions performed by awk. It 
prints each line of the input to the output. More useful is to 
print one or more fields from each line. For instance, using the 
file " countries" , that was used earlier, 

awk '{ print $1, $3 }' countries 

prints the name of the country and the population: 
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Russia 262 
Canada 24 
China 866 
USA 219 
Brazil 116 
Australia 14 
India 637 
Argentina 14 
Sudan 19 
Algeria 18 



Note that the use of a semicolon at the end of statements in 
awk programs is optional. Awk accepts 

{print $1 } 

and 

{print $1; } 

equally and takes them to mean the same thing. If you want to 
put two awk statements on the same line of an awk script, 
the semicolon is necessary. For example, the following 
semicolon is necessary if you want the number 5 printed: 

{x=5; print x } 
Parentheses are also optional with the print statement. 

print $3, $2 
is the same as 

print ($3, $2 ) 

Items separated by a comma in a print statement are separated 
by the current output field separators (normally spaces, even 
though the input is separated by tabs) when printed. The OFS 
is another special variable that can be changed by you. These 
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special variables are summarized in a later section. 

An exercise: 

Using the input file, " countries" , print the continent followed 
by the country followed by the population for each input record. 
Then pipe the output to the UNIX operating system command 
" sort" so that all countries from a given continent are printed 
together. 

Print also prints strings directly from your programs with the 
awk script 

{print " hello, world" } 

from an earlier section. 

An exercise: 

Print a header to the output of the previous exercise that says 
" Population of Largest Countries" followed by headers to the 
columns that follow describing what is in that column, for 
example. Country or Population. 



As we have already seen, awk makes available a number of 
special variables with useful values, for example, FS and RS. 
We now introduce another special variable in the next example. 
NR and NF are both integers that contain the number of the 
present record and the number of fields in the present record, 
respectively. Thus, 

{print NR, NF, $0} 

prints each record number and the number of fields in each 
record followed by the record itself. Using this program on the 
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file, " countries" 


yields: 






1 4 Russia 


8650 


262 


Asia 


2 5 Canada 


3852 


24 


North America 


3 4 China 


3692 


866 


Asia 


4 5 USA 


3615 


219 


North America 


5 5 Brazil 


3286 


116 


South America 


6 4 Australia 


2968 


14 


Australia 


7 4 India 


1269 


637 


Asia 


8 5 Argentina 


1072 


26 


South America 


9 4 Sudan 


968 


19 


Africa 


10 4 Algeria 


920 


18 


Africa 


and the program 








{print 


NR, 


$1} 



prints 



1 Russia 

2 Canada 

3 China 

4 USA 

5 Brazil 

6 Australia 

7 India 

8 Argentina 

9 Sudan 

10 Algeria 



This is an easy way to supply sequence numbers to a list. 
Print, by itself, prints the input record. Use 

print " " 

to print the empty line. 
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Awk also provides the statement printf so that you can format 
output as desired. Print uses the default format " % .6g" for 
each variable printed. 

printf format, expr, expr, ... 

formats the expressions in the list according to the 
specification in the string, format, and prints them. The format 
statement is exactly that of the printf in the C library. For 
example, 



{ printf " %10s %6d0, $1, $2, $3 } 

prints $1 as a string of 10 characters (right justified). The 
second and third fields (6-digit numbers) make a neatly 
columned table. 



Russia 


8650 


262 


Canada 


3852 


244 


China 


3692 


866 


USA 


3615 


219 


Brazil 


3286 


116 


Australia 


2968 


14 


India 


1269 


637 


Argentina 


1072 


26 


Sudan 


968 


19 


Algeria 


920 


18 



With printf, no output separators or newlines are produced 
automatically. You must add them as in this example. In the C 
library version of printf, the various escape characters " \n" , 
" \t" , " \b" (backspace) and " \r" (carriage return) are valid 
with the awk printf. 

There is a third way that printing can occur on standard output 
when a pattern is specified but there is no action to go with it. 
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In this case, the entire record $0 is printed. For example, the 
program 

/x/ 

prints any record that contains the character " x" . 

There are two special variables that go with printing, OFS and 
ORS. These are by default set to blank and the newline 
character, respectively. The variable OFS is printed on the 
standard output when a comma occurs in a print statement 
such as 



{ x=" hello" ; y=" world" 
print x,y 

} 



which prints 

hello world 
However, without the comma in the print statement as 



{ x=" hello" ; y=" world" 
print X y 

} 



you get 

helloworld 

To get a comma on the output, you can either insert it in the 
print statement as in this case 
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{ x=" hello" ; y=" world" 
print x" ," y 

} 



or you can change OFS in a BEGIN section as in 



BEGIN {OFS=" , " } 
{ x=" hello" ; y=" world" 
print X, y 

} 



both of these last two scripts yields 
hello, world 



Note that the output field separator is not used when $0 is 
printed. 



OUTPUT: TO DIFFERENT FILES 

The UNIX operating system shell allows you to redirect 
standard output to a file. The awk program also lets you 
direct output to many different files from within your awk 
program. For example, with our input file " countries" , we 
want to print all the data from countries of Asia in a file called 
" ASIA" , all the data from countries in Africa in a file called 
" AFRICA" , and so forth. This is done with the following awk 
program: 
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{ if ($4 == " Asia" ) print > " ASIA" 
if ($4 == " Europe" ) print > " EUROPE" 
if ($4 == " North" ) print > " NORTH.AMERICA" 
if ($4 == " South" ) print > " SOUTH_AMERICA" 
if ($4 == " Australia" ) print > " AUSTRALIA" 
if ($4 == " Africa" ) print > " AFRICA" 

} 

The flow of control statements (for example, " if" ) are 
discussed later. 



In general, you may direct output into a file after a print or a 
printf statement by using a statement of the form 

print > " FILE" 

where FILE is the name of the file receiving the data, and the 
print statement may have any legal arguments to it. 

Notice that the file names are quoted. Without quotes, the file 
names are treated as uninitialized variables and all output then 
goesi to the same file. 

If > is replaced by », output is appended to the file rather 
than overwriting it. 

Users should also note that there is an upper limit to the 
number of files that are written in this way. At present it is 
ten. 
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OUTPUT: TO PIPES 

It is also possible to direct printing into a pipe instead of a file. 
For example, 



{ 

if ($2 == " XX" ) print | " mail mary" 

} 



where " mary" is someone's login name, any record is sent 
(with the second field equal to " XX" ) to the user, mary, as 
mail. Awk waits until the entire program is run before it 
executes the command that was piped to, in this case the 
" mail" command. 



For example: 

{ 

print $1 1 " sort" 

} 



takes the first field of each input record, sorts these fields, and 
then prints them. The command in parentheses is any UNIX 
operating system command. 



An exercise: 

Write an awk script that uses the input file to 

• List countries that were used previously 

• Print the name of the countries 
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• Print the population of each country 

• Sort the data so that countries with the largest 
population appear first 

• Mail the resulting list to yourself. 

Another example of using a pipe for output is the following 
idiom which guarantees that its output always goes to your 
terminal: 



print ... I " cat -u > /dev/tty" 



Only one output statement to a pipe is permitted in an awk 
program. In all output statements involving redirection of 
output, the files or pipes are identified by their names but they 
are created and opened only once in the entire run. 



COMMENTS 

Comments are placed in awk programs; they begin with the 
character # and end with the end of the line as in 



print X, Y # this is a comment 
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PATTERNS 

A pattern in front of an action acts as a selector that 
determines if the action is to be executed. A variety of 
expressions are used as patterns: 

• Regular expressions 

• Arithmetic relational expressions 

• String valued expressions 

• Combinations of these. 

BEGIN and END 

The special pattern, BEGIN, matches the beginning of the input 
before the first record is read. The pattern, END, matches the 
end of the input after the last line is processed. BEGIN and 
END thus provide a way to gain control before and after 
processing for initialization and wrapping up. 

An example: 

As you have seen, you can use BEGIN to put column headings 
on the output 



BEGIN (print " Country" , " Area" , " Population" , " Continent" 

{print} 

which produces 

Country Area Population Continent 
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Russia 8650 


262 


Asia 


Canada 3852 


24 


North America 


China 3692 


866 


Asia 


USA 3615 


219 


North America 


Brazil 3286 


116 


South America 


Australia 2968 


14 


Australia 


India 1269 


637 


Asia 


Argentina 


1072 


26South America 


Sudan 968 


19 


Africa 


Algeria 920 


18 


Africa 



Formatting is not very good here; printf would do a better job 
and is usually mandatory if you really care about appearance. 

Recall also, that the BEGIN section is a good place to change 
special variables such as FS or RS. 



Example: 

BEGIN { FS= " " 

print " Countries" , " Area" , " Population" , " Continent" 

} 

{print} 
END {print " The number of records is" , NR} 



In this program, FS is set to a tab in the BEGIN section and as 
a result all records (in the file countries) have exactly four 
fields. 



Note that if BEGIN is present it is the first pattern; END is 
the last if it is used. 
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Relational Expressions 



An awk pattern is any expression involving comparisons 
between strings of characters or numbers. For example, if you 
want to print only countries with more than 100 million 
population, use 



$3 >100 



This tiny awk program is a pattern without an action so it 
prints each line whose third field is greater than 100 as follows: 



Russia 


8650 


262 


Asia 


China 


3692 


866 


Asia 


USA 


3615 


219 


North America 


Brazil 


3286 


116 


South America 


India 


1269 


637 


Asia 



To print the names of the countries that are in Asia, type 

$4 == " Asia" {print $1} 
which produces 



Russia 

China 

India 



The conditions tested are <, <=, ==, !=, >=, and >. In such 
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relational tests if both operands are numeric, a numerical 
comparison is made. Otherwise, the operands are compared as 
strings. Thus, 



$1 >= " S" 



selects lines that begin with S, T, U, and so forth which in this 
case is 



USA 3615 219 North America 
Sudan 968 19 Africa 



In the absence of other information, fields are treated as 
strings, so the program 



$1 == $4 



compares the first and fourth fields as strings of characters 
and prints the single line 



Australia 2968 14 Australia 



If fields appear as numbers, the comparisons are done 
numerically. 
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Regular Expressions 



Awk provides more powerful capabilities for searching for 
strings of characters than were illustrated in the previous 
section. These are regular expressions. The simplest regular 
expression is a literal string of characters enclosed in slashes. 



/Asia/ 

This is a complete awk program that prints all lines which 
contain any occurrence of the name " Asia" . If a line contains 
" Asia" as part of a larger word like " Asiatic" , it is also 
printed (but there are no such words in the countries file.) 

Awk regular expressions include 

• Regular expression forms found in the text editor 

• ed and the pattern finder 

• grep in which certain characters have special meanings. 

For example, we could print all lines that begin with A with 

or all lines that begin with A, B, or C with 
/lABC]/ 
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or all lines that end with " ia" with 



/ia$/ 



In general, the circumflex ( ) indicates the beginning of a line. 
The dollar sign ($) indicates the end of the line and characters 
enclosed in brackets ,{}, match any one of the characters 
enclosed. In addition, awk allows parentheses for grouping, the 
pipe ( I ) for alternatives, + for " one or more" occurrences, and 
? for " zero or one" occurrences. For example. 



/x I y/ {print} 
prints all records that contain either an " x" or a " y" 

/ax+b/ {print} 



prints all records that contain an " a" followed by one or more 
" x's" followed by a " b" . For example, axb, Paxxxxxxxb, 
QaxxbR. 



/ax?b/ {print} 



prints all records that contain an " a" followed by zero or one 
" x" followed by a " b" . For example: ab, axb, yaxbPPP, CabD. 

The two characters " ." and " *" have the same meaning as they 
have in ed: namely, " ." can stand for any character and " *" 
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means zero or more occurrences of the character preceding it. 
For example, 



/a.b/ 



matches any record that contains an " a" followed by any 
character followed by a " b" . That is, the record must contain 
an " a" and a " b" separated by exactly one character. For 
example, /a.b/ matches axb, aPb and xxxxaXbxx, but NOT ab, 
axxb. 



/ab*c/ 



matches a record that contains an " a" followed by zero or more 
" b" 's followed by a " c" . For example, it matches 



ac 

abc 

pqrabbbbbbbbbbc901 

Just as in ed, it is possible to turn off the special meaning of 
these metacharacters such as " " and " *" by preceding these 
characters with a backslash. An example of this is the pattern 



//.*// 



which matches any string of characters enclosed in slashes. 

One can also specify that any field or variable matches a 
regular expression (or does not match it) by using the operators 
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or !*. For example, with the input file countries as before, the 
program 

$1 ■ /ia$/ (print $1} 

prints all countries whose name ends in " ia" : 



Russia 
Australia 
India 
Algeria 



that is indeed different from lines which end in " ia" . 



Combinations of Patterns 

A pattern is made up of similar patterns combined with the 
operators | | (OR), && (AND), ! (NOT), and parentheses. For 
example. 



$2 >= 3000 && $3 >=100 



selects lines where both area AND population are large. For 
example. 



Russia 8650 262 Asia 

China 3692 866 Asia 

USA 3615 219 North America 

Brazil 3286 116 South America 



while 
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== " Asia" II $4 == " Africa" 



selects lines with Asia or Africa as the fourth field. An 
alternate way to write this last expression is with a regular 
expression: 



$1 ~ /'(Asia I Africa))!/ 



&& and I I guarantee that their operands are evaluated from 
left to right; evaluation stops as soon as truth or falsehood is 
determined. 



Pattern Ranges 

The " pattern" that selects an action may also consist of two 
patterns separated by a comma as in 



patternl, pattern2 { ... } 



In this case, the action is performed for each line between an 
occurrence of patternl and the next occurrence of pattern2 
(inclusive). As an example with no action 



/Canada/,/Brazil/ 



prints all lines between the one containing " Canada" and the 
line containing " Brazil" . For example, 
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Canada 


3852 


24 


North America 


China 


3692 


866 


Asia 


USA 


3615 


219 


North America 


Brazil 


3286 


116 


South America 



while 



NR == 2, NR == 5 { ... } 



does the action for lines 2 through 5 of the input. Different 
types of patterns are mixed as in 



/Canada/, $4 == " Africa" 



and prints all lines from the first line containing " Canada" up 
to and including the next record whose fourth field is " Africa" . 

Users should note that patterns in this form occur OUTSIDE of 
the action parts of the awk programs (outside of the braces 
that define awk actions). If you need to check patterns inside 
an awk action (inside the braces), use a flow of control 
statement such as an " if" statement or a " while" statement. 
Flow of control statements are discussed in the part " BUILT- 
IN FUNCTIONS" . 
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ACTIONS 

An awk action is a sequence of action statements separated by 
newlines or semicolons. These action statements do a variety of 
bookkeeping and string manipulating tasks. 



Variables, Expressions, and Assignments 

The awk provides the ability to do arithmetic and to store the 
results in variables for later use in the program. However, 
variables can also store strings of characters. You cannot do 
arithmetic on character strings, but you can stick them 
together and pull them apart as shown. As an example, 
consider printing the population density for each country in the 
file countries. 



(print $1, (1000000 * $3)/($2 * 1000) ) 



(Recall that in this file the population is in millions and the 
area in thousands.) The result is population density in people 
per square mile. 

Russia 30.289 
Canada 6.23053 
China 234.561 
USA 60.5809 
Brazil 35.3013 
Australia 4.71698 
India 501.97 
Argentina 24.2537 
Sudan 19.6281 
Algeria 19.5652 

The formatting is bad; so using printf instead gives the 
program 
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{printf " %10s %6.1f0, $1, (1000000 * $3)/($2 * 1000) } 
and the output 



Russia 30.3 

Canada 6.2 

China 234.6 

USA 60.6 

Brazil 35.3 

Australia 4.7 

India 502.0 

Argentina 24.3 

Sudan 19.6 

Algeria 19.6 



Arithmetic is done internally in floating point. The arithmetic 
operators are +. -, *, / and % (mod or remainder). 

To compute the total population and number of countries from 
Asia, we could write 



/Asia/ { pop = pop + $3; n = n + 1 } 

END {print " total population of" , n, " Asian countries is" , pop } 



which produces total population of three Asian countries is 
1765. 



Actually, no experienced programmer would write 
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{pop = pop + $3; n = n + 1 } 



since both assignments are written more clearly and concisely. 
The better way is 



{pop += $3; ++n } 



Indeed, these operators, ++, — , -=, /=, * =, +=, and %= are 
available in awk as they are in C. Operator x += y has the 
same effect as x = x + y but += is shorter and runs faster. 
The same is true of the ++ operator; it adds one to the value of 
a variable. The increment operators ++ and — (as in C) are 
used as prefix or as postfix operators. These operators are also 
used in expressions. 



Initialization of Variables 

In the previous example, we did not initialize pop nor n; yet, 
everything worked properly. This is because (by default) 
variables are initialized to the null string which has a 
numerical value of 0. This eliminates the need for most 
initialization of variables in BEGIN sections. We can use 
default initialization to advantage in this program which finds 
the country with the largest population. 



maxpop < $3 { 

maxpop = $3 
country = $1 

} 
END {print country, maxpop} 
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which produces 



China 866 



Field Variables 

Fields in awk share essentially all of the properties of 
variables. They are used in arithmetic and string operations 
and may be assigned to and initialized to the null string. Thus, 
divide the second field by 1000 to convert the area to millions of 
square miles by 



{ $2 /= 1000; print } 
or process two fields into a third with 



BEGIN { FS = " " } 

{ $4 = 1000 * $3 / $2; print } 



or assign strings to a field as in 

/USA/ { $1 = " United States" ; print } 



which replaces USA by United States and prints the affected 
line 
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United States 3615 219 North America 



Fields are accessed by expressions; thus, $NF is the last field 
and $(NF-1) is the second to the last. Note that the 
parentheses are needed since $NF-1 is 1 less than the values in 
the last field. 



String Concatenation 

Strings are concatenated by writing them one after the other as 
in the following example: 



{ X = " hello" 
X = X " , world" 
print X 

} 



prints the usual 

hello, world 
With input from the file " countries" , the following program: 



/A/ { s = s " " $1 } 

END { print s } 



prints 



Australia Argentina Algeria 
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Variables, string expressions, and numeric expressions may 
appear in concatenations; the numeric expressions are treated 
as strings in this case. 



Special Variables 

Some variables in awk have special meanings. These are 
detailed here and the complete list given. 



NR 

NF 
FS 

RS 



$0 
OFS 

ORS 

OFMT 

FILENAME 



Number of the current record. 

Number of fields in the current record. 

Input field separator, by default it is set 
to a blank or tab. 

Input record separator, by default it is set 
to the newline character. 

The ith input field of the current record. 

The entire current input record. 

Output field separator, by default it is set 
to a blank. 

Output record separator, by default it is 
set to the newline character. 

The format for printing numbers, with 
the print statement, by default is " % .6g" . 

The name of the input file currently being 
read. This is useful because awk 
commands are typically of the form 

awk -f program filel file2 fileS ... 
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Type 

Variables (and fields) take on numeric or string values 
according to context. For example, in 



pop += $3 
pop is presumably a number, while in 

country = $1 
country is a string. In 

maxpop < $3 



the type of maxpop depends on the data found in $3. It is 
determined when the program is run. 

In general, each variable and field is potentially a string or a 
number or both at any time. When a variable is set by the 
assignment 



expr 



its type is set to that of expr. (Assignment also includes +=, 
++, -=, and so forth.) An arithmetic expression is of the type, 
" number" ; a concatenation of strings is of type " string" . If the 
assignment is a simple copy as in 
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vl = v2 



then the type of vl becomes that of v2. 

In comparisons, if both operands are numeric, the comparison 
is made numerically. Otherwise, operands are coerced to strings 
if necessary and the comparison is made on strings. 

The type of any expression is coerced to numeric by subterfuges 
such as 



expr + 
and to string by 



expr " " 



This last expression is string concatenated with the null string. 



Arrays 

As well as ordinary variables, awk provides 1-dimensional 
arrays. Array elements are not declared; they spring into 
existence by being mentioned. Subscripts may have any non- 
null value including non-numeric strings. 

As an example of a conventional numeric subscript, the 
statement 
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x[NR] = $0 



assigns the current input line to the NRth element of the array 
X. In fact, it is possible in principle (though perhaps slow) to 
process the entire input in a random order with the following 
awk program: 



{ x[NR] = $0 } 
END { ... program ... } 



The first line of this program records each input line into the 
array x. In particular, the following program 



{ x[NRl = $1} 



(when run on the file countries) produces an array of elements 
with 



x[l] = " Russia" 
x[2] = " Canada" 
x[3] = " China" 

... and so forth. 



Arrays are also indexed by non-numeric values that give awk a 
capability rather like the associative memory of Snobol tables. 
For example, we can write 
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/Asia/ {pop[" Asia" ] += $3 } 

/Africa/ {pop [Africa] += $3 } 

END print " Asia=" pop[" Asia" ], " Africa=" pop[" Africa" ] } 



which produces 



Asia=1765 Africa=37 



Notice the concatenation. Also, any expression can be used as a 
subscript in an array reference. Thus, 



area[$l] = $2 



uses the first field of a line (as a string) to index the array 
area. 



BUILT IN FUNCTIONS 

The function 
length 



is provided by awk to compute the length of a string of 
characters. The following program prints each record preceded 
by its length: 
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{print length, $0 } 



In this case (the variable) length means length($0), the length 
of the present record. In general, length(x) will return the 
length of X as a string. 



Example: 

With input from the file countries, the following awk program 
will print the longest country name: 



length($l) > max (max = length($l); name = $1 } 
END {print name} 



The function 
split 

split (s, array) assigns the fields of the string " s" to successive 
elements of the array, " array" . 

For example; 

split(" Now is the time" , w) 

assigns the value " Now" to w[l], " is" to w[2], " the" to w[3] 
and " time" to w[4]. All other elements of the array w[l, if any, 
are set to the null string. It is possible to have a character 
other than a blank as the separator for the elements of w. For 
this, use split with three elements. 
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n = split(s, array, sep) 



This splits the string s into array [1], ..., array [n]. The number 
of elements found is returned as the value of split. If the sep 
argument is present, its first character is used as the field 
separator; otherwise, FS is used. This is useful if in the middle 
of an awk script, it is necessary to change the record separator 
for one record. 

Also provided by the awk are the 
Math Functions 



sqrt, 
log, 
exp, 
int. 



They provide the square root function, the base e logarithm 
function, exponential and integral part functions. This last 
function returns the greatest integer less than or equal to its 
argument. These functions are the same as those of the C 
library (int corresponds to the libc Jloor function) and so they 
have the same return on error as those in libc. (See UNIX 
System V User's Manual.) 

The substring function 
substr 



extracts portions of strings. For example, substr(s,m,n) 
produces the substring of s that begins at position m and is at 
most n characters long. If the third argument (n in this case) is 
omitted, the substring goes to the end of s. For example, we 
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could abbreviate the country names in the file countries by 

{ $1 = substr($l, 1, 3); print } 



which produces 




Rus 


8650 


262 


Asia 


Can 


3852 


24 


North America 


Chi 


3692 


866 


Asia 


USA 


3615 


219 


North America 


Bra 


3286 


116 


South America 


Aus 


2968 


14 


Australia 


Ind 


1269 


637 


Asia 


Arg 


1072 


26 


South America 


Sud 


968 


19 


Africa 


Alg 


920 


18 


Africa 



If s is a number, substr uses its printed image; 
substr(123456789,3,4)=3456. 

The function 
index: 

index (sl,s2) returns the leftmost position where the string s2 
occurs in si or zero if s2 does not occur in si. 

The function 
sprintf 
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formats expressions as the printf statement does but will 
assign the resulting expression to a variable instead of sending 
the results to stdout. For example, 



X = sprintf( " %10s %6d " , $1, $2 ) 

sets X to the string produced by formatting the values of $1 and 
$2. The X is then used in subsequent computations. 

The function 
getline 



immediately reads the next input record. Fields NR and $0 are 
all set but control is left at exactly the same spot in the awk 
program. Getline returns for the end of file and a 1 for a 
normal record. 



FLOW OF CONTROL 

The awk provides the basic flow of control statements 

• if-else 

• while/fR 

• for 

with statement grouping as in C language. 

The if statement is used as follows: 
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if ( condition ) statementl else statement2 



The condition is evaluated; and if it is true, statementl is 
executed; otherwise, statement2 is executed. The else part is 
optional. Several statements enclosed in braces ({,}) are treated 
as a single statement. Rewriting the maximum population 
computation from the pattern section with an if statement 
results in 



{ if (maxpop < $3) { 
maxpop= $3 
country= $1 

}, } 
END { print country, maxpop } 



There is also a while statement in awk. 



while ( condition ) statement 



The condition is evaluated; if it is true, the statement is 
executed. The condition is evaluated again, and if true, the 
statement is executed. The cycle repeats as long as the 
condition is true. For example, the following prints all input 
fields one per line: 
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{ i = l 

while (i <= NF) { 
pint $i 

++i 

} 
} 



Another example is the Euclidean algorithm for finding the 
greatest common divisor of $1 and $2: 



{printf " the greatest common divisor of " $1 " and " , $2, " is" 

while ($1 != $2) { 

if ($1 > $2) $1 = $1 - $2 
else $2 = $2 - $1 

} 
printf $1 " 

} 



The for statement is like that of C. 



for ( expressionl ; condition ; expression2 ) statement 



has the same effect as 



expressionl 
while (condition) { 

statement 

expression2 

} 
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so 



{ for (i=l ; i <= NF; i++) 

print $i 
} 



is another awk program that prints all input fields one per 
line. 



This is an alternate form of the or statement that is suited for 
accessing the elements of an associative array as is in awk. 



for (i in array) statement 



executes statement with the variable i set in turn to each 
subscript of array. The subscripts are each accessed once but in 
random order. Chaos will ensue if the variable i is altered or if 
any new elements are created within the loop. For example, 
you could use the "for" statement to print the record number 
followed by the record of all input records after the main 
program is executed. 



{ x[NR] = $0 } 
END { for(i in x) { print i, x[i] } 



A more practical example is the following use of strings to 
index arrays to add the populations of countries by continents: 
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BEGIN {FS=" " } 

{ population [$4] =+ $3} 
END {for(i in population) 

print i, population [i] 

} 



In this program, the body of the for loop is executed for i 
equal to the string " Asia" , then for i equal to the string 
" North America" , and so forth until all the possible values of i 
are exhausted; that is, until all the strings of names of 
countries are used. Note, however, that the order the loops are 
executed is not specified. If the loop associated with " Canada" 
is executed before the loop associated with the string " Russia" , 
such W program produces 



South America 26 
Africa 16 
Asia 637 
Australia 14 
North America 219 



Note that the expression in the condition part of an if, while, 
or for statement can include relational operators like <, <=, >, 
>=, ==, and !=; it can include regular expressions that are used 
with the " matching" operators *" and P; it can include the 
logical operators It &&, and !; and it can also include 
parentheses for grouping. 

The break statement (when it occurs within a while or for 
loop) causes an immediate exit from the while or for loop. 

The continue statement (when it occurs within a w^hile or for 
loop) causes the next iteration of the loop to begin. 
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The next statement in an awk program causes awk to skip 
immediately to the next record and begin scanning patterns 
from the top of the program. (Note the difference between 
getline and next. Getline does not skip to the top of the awk 
program.) 

If an exit statement occurs in the BEGIN section of an awk 
program, the program stops executing and the END section is 
not executed (if there is one). 

An exit that occurs in the main body of the awk program 
causes execution of the main body of the awk program to stop. 
No more records are read, and the END section is executed. 

An exit in the END section causes execution to terminate at 
that point. 



REPORT GENERATION 

The flow of control statements in the last section are especially 
useful when awk is used as a report generator. Awk is useful 
for tabulating, summarizing, and formatting information. We 
have seen an example of awk tabulating in the last section 
with the tabulation of populations. Here is another example of 
this. Suppose you have a file " prog.usage" that contains lines 
of three fields; name, program, and usage: 



Smith 


draw 


3 


Brown 


eqn 


1 


Jones 


nroff 


4 


Smith 


nroff 


1 


Jones 


spell 


5 


Brown 


spell 


9 


Smith 


draw 


6 
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The first line indicates that Smith used the draw program 
three times. If you want to create a program that has the total 
usage of each program along with the names in alphabetical 
order and the total usage, use the following program, called 

list, a: 

{ use[$l " " $2] += $3} 
END {for (np in use) 

print np " " use[np] | " sort +0 +2nr" } 



This program produces the following output when used on the 
input file, prog.usage. 



Brown 


eqn 


1 


Brown 


spell 


9 


Jones 


nroff 


4 


Jones 


spell 


5 


Smith 


draw 


9 


Smith 


nroff 


1 



If you would like to format the previous output so that each 
name is printed only once, pipe the output of the previous awk 
program into the following program, called " format.a" : 



{ if ($1 != prev) { 
print $1":" 
prev = $1 

} 
print" " $2" " $3 

} 



The variable prev prints the unique values of the first field. 
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The command 



awk -f list.a prog.usage I awk -f format.a 



;ives the output 




Brown: 




eqn 
spell 
Jones: 


1 
9 


nroff 


4 


spell 
Smith: 


5 


draw 


9 


nroff 


1 



It is often useful to combine different awk scripts and other 
shell commands such as sort as was done in the last script. 



COOPERATION WITH THE SHELL 

Normally, an awk program is either contained in a file or 
enclosed within single quotes as in 



awk '{print $1}' 



Awk uses many of the same characters that the shell does, such 
as $ and the double quote. Surrounding the program by ' ... ' 
ensures that the shell passes the awk program to awk intact. 
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Consider writing an aw^k program to print the nth field, where 
n is a parameter determined when the program is run. That is, 
we want a program called field such that 



field n 
runs the awk program 

awk '{print $n}' 
How does the value of n get into the awk program* 



There are several ways to do this. One is to define field as 
follows: 



awk '{print $'$1'}' 



Spaces are critical here: as written there is only one argument, 
even though there are two sets of quotes. The $1 is outside the 
quotes, visible to the shell, and therefore substituted properly 
when field is invoked. 

Another way to do this job relies on the fact that the shell 
substitutes for $ parameters within double quotes. 



awk " {print $1}" 
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Here the trick is to protect the first $ with a \\; the $1 is again 
replaced by the number when field is invoked. 



This kind of trickery is extended in remarkable ways, but it is 
hard to understand quickly. 



MISCELLANEOUS HINTS 

You can simulate the effect of multidimensional arrays by 
creating your own subscripts. For example, 



for ( i = 1; i <= 10; i++) 

for(j = l;j <=10;j++) 
mult[i"," jl = . .. 



creates an array whose subscripts have the form i,j; that is, 1,1; 
1,2; and so forth, and thus simulates a 2-dimensional array. 
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Chapter 17 
THE LINK EDITOR 

GENERAL 

The link editor [ld{l)*] is a UNIX system support tool used on 
the VAXf processor and UNIX PC. The Id creates executable 
object files by combining object files, performing relocation, 
and resolving external references. The Id also processes 
symbolic debugging information. The inputs to Id are 
relocatable object files produced either by the compiler [cc(l)], 
the assembler [as(l)], or by a previous Id run. The Id combines 
these object files to form either a relocatable or an absolute 
(i.e., executable) object file. 

The Id also supports a command language that allows users to 
control the Id process with great flexibility and precision. The 
UNIX system Id shares most of its source with other Ids in-use 
on other processors and operating systems. Therefore, the 
UNIX system Id provides many powerful features that may or 
may not be useful on a UNIX system. 

Although the link edit process is controlled in detail through 
use of the Id command language described later, most users do 
not require this degree of flexibility, and the manual page is 
sufficient instruction in the use of Id. 

The command language (described later) supports the ability to 



* Part 1 of the UNIX system User Manual 

t Trademark of Digital Equipment Corporation 
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• Specify the memory configuration of the machine 

• Combine object file sections in particular fashions 

• Cause the files to be bound to specific addresses or within 
specific portions of memory 

• Define or redefine global symbols at link edit time. 

There are several concepts and definitions with which you 
should familiarize yourself before proceeding further. 

Memory Configuration 

The virtual memory of the target machine is, for purposes of 
allocation, partitioned into configured and unconfigured 
memory. The default condition is to treat all memory as 
configured. It is common with microprocessor applications, 
however, to have different types of memory at different 
addresses. For example, an application might have 3K of 
PROM (Programmable Read-Only Memory) beginning at 
address 0, and 8K of RAM (Read-Only Memory) starting at 
20K. Addresses in the range 3K to 20K-1 are then not 
configured. Unconfigured memory is treated as "reserved" or 
"unusable" by the Id. Nothing can ever he linked into 
unconfigured memory. Thus, specifying a certain memory 
range to be unconfigured is one way of marking the addresses 
(in that range) "illegal" or "nonexistent" with respect to the 
linking process. Memory configurations other than the default 
must be explicitly specified by you (the user). 

Unless otherwise specified, all discussions in this document of 
memory, addresses, etc. are with respect to the configured 
sections of the address space. 
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Section 

A section of an object file is the smallest unit of relocation and 
must be a contiguous block of memory. A section is identified 
by a starting address and a size. Information describing all 
the sections in a file is stored in "section headers" at the start 
of the file. Sections from input files are combined to form 
output sections that contain executable text, data, or a mixture 
of both. Although there may be "holes" or gaps between input 
sections and between output sections, storage is allocated 
contiguously within each output section and may not overlap a 
hole in memory. 



Addresses 

The physical address of a section or symbol is the relative 
offset from address zero of the address space. The physical 
address of an object is not necessarily the location at which it is 
placed when the process is executed. For example, on a system 
with paging, the address is with respect to address zero of the 
virtual space, and the system performs another address 
translation. 



Binding 

It is often necessary to have a section begin at a specific, 
predefined address in the address space. The process of 
specifying this starting address is called "binding", and the 
section in question is said to be "bound to" or "bound at" the 
required address. While binding is most commonly relevant to 
output sections, it is also possible to bind global symbols with 
an assignment statement in the Id command language. 
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Object File 

Object files are produced both by the assembler (typically as a 
result of calling the compiler) and by the Id. The Id accepts 
relocatable object files as input and produces an output object 
file that may or may not be relocatable. Under certain special 
circumstances, the input object files given to the Id can also be 
absolute files. 



Files produced from the compiler/assembler always contain 
three sections, called .text, .data, and .bss. The .text section 
contains the instruction text (for example, executable 
instructions), .data contains initialized data variables, and .bss 
contains uninitialized data variables. For example, if a C 
program contained the global (i.e., not inside a function) 
declarations 

int i = 100; 
char abc[200]; 

and the assignment 

abc[il = 0; 

then compiled code from the C assignment is stored in .text. 
The variable i is located in .data, and abc is located in .bss. 
There is an exception to the rule however; both initialized and 
uninitialized statics are allocated into the .data section. The 
value of an uninitialized static in a .data section is zero. 



USING THE LINK EDITOR 

The Id is called by the command 

Id [options] filenamel filename2 . . . 
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Files passed to the Id must be object files, archive libraries 
containing object files, or text source files containing Id 
directives. The Id uses the "magic number" (in the first two 
bytes of the file) to determine which type of file is encountered. 
If the Id does not recognize the magic number, it assumes the 
file is a text file containing Id directives and attempts to parse 
it. 

Input object files and archive libraries of object files are linked 
together to form an output object file. If there are no 
unresolved references, this file is executable on the target 
machine. An input file containing directives is referred to as 
an ifile in this document. Object files have the form "name.o" 
throughout the examples in this chapter. The names of actual 
input object files need not follow this convention. 

If you merely want to link the object files filel.o and file2.o, the 
following command is sufficient: 

Id filel.o file2.o 

No directives to the Id are needed. If no errors are encountered 
during the link edit, the output is left on the default file a.out. 
The sections of the input files are combined in order. That is, 
if filel.o and file2.o each contain the standard sections .text, 
.data, and .bss, the output object file also contains these three 
sections. The output .text section is a concatenation of .text 
from filel.o and .text from file2.o. The .data and .bss sections 
are formed similarly. The output .text section is then bound at 
an address appropriate for the target machine (0X80000 on the 
UNIX PC). The output .data and .bss sections are link edited 
together into contiguous addresses (the particular address 
depending on the particular processor). 

Instead of entering the names of files to be link edited (as well 
as Id options on the Id command line), this information can be 
placed into an ifile, and just the ifile passed to Id. For example, 
if you are going to frequently link the object files filel.o, file2.o, 
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and fileS.o with the same options fl and f2, then enter the 
command 



Id -fl -f2 filel.o file2.o fileS.o 

each time it is necessary to invoke Id. Alternatively, an ifile 
containing the statements 

-fl 

-f2 

filel.o 

file2.o 

fileS.o 



could be created, and then the following UNIX system 
command would serve: 



Id ifilename 

Note that it is perfectly permissible to specify some of the 
object files to be link edited in the ifile and others on the 
command line— as well as some options in the ifile and others 
on the command line. Input object files are link edited in the 
order they are encountered, whether this occurs on the 
command line or in an ifile. As an example, if a command line 
were 

Id filel.o ifile file2.o 

and the ifile contained 

fileS.o 
file4.o 

then the order of link editing would be: filel.o, fileS.o, file4.o, 
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and file2.o. Note from this example that an ifile is read and 
processed immediately upon being encountered in the command 
line. 

Options may be interspersed with file names both on the 
command line and in an ifile. The ordering of options is not 
significant, except for the "1" and "L" options for specifying 
libraries. The "1" option is a shorthand notation for specifying 
an archive library, and an archive library is just a collection of 
object files. Thus, as is the case with any object file, libraries 
are searched as they are encountered. The "L" specifies an 
alternative directory for searching for libraries. Therefore, to 
be effective, a "-L" option must appear before any "-1" options. 

All options for Id must be preceded by a hyphen (-) whether in 
the ifile or on the Id command line. Options that have an 
argument (except for the "-1" and "-L" options) are separated 
from the argument by white space (blanks or tabs). The 
following options (in alphabetical order) are supported, though 
not all options are available on each processor. 

-e epsym Defines the primary entry point of the output file to 
be the symbol given by the argument "ss". See 
"Changing the Entry Point" in "NOTES AND 
SPECIAL CONSIDERATIONS" for a discussion of 
how the option is used. 

-f fill Sets the default fill value. This value is used to fill 

"holes" formed within output sections. Also, it is 
used to initialize input .bss sections when they are 
combined with other non-.bss input sections. The 
argument "bb" is a 2-byte constant. If the "-f" 
option is not used, the default fill value is zero. 

-Ix Specifies a UNIX system archive library file as Id 

input. The argument is a character string (less than 
10 characters) immediately following the "-1" 
without any intervening white space. As an 
example, -Ic refers to libc.a, -IC to libC.a, etc. The 
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given archive library must contain valid object files 
as its members. 

-m Produces a map or listing of the input/output 

sections (including "holes") on the standard output. 

-0 outfile 

Names the output object file. The argument "name" 
is the name of the UNIX system file to be used as 
the output file. The default output object file name 
is "a.out". The "name" can be a full or partial UNIX 
system pathname. 

-r Retains relocation entries in the output object file. 

Relocation entries must be saved if the output file is 
to be used as an input file in a subsequent Id call. If 
the -r option is used, unresolved references do not 
prevent the creation of an output object file. 

-s Strips line number entries and symbol table 

information from the output object file. Relocation 
entries ("-r" option) are meaningless without the 
symbol table, hence use of "-s" precludes the use of 
"-r". All symbols are stripped, including global and 
undefined symbols. 

-u symname 

Introduces an unresolved external symbol into the 
output file's symbol table. The argument "sym" is 
the name of the symbol. This is useful for linking 
entirely from a library, since initially the symbol 
table is empty and an unresolved reference is needed 
to force the linking of an initial routine from the 
library. 

-X Does not preserve any local (nonglobal) symbols in 

the outi^ut symbol table* enter external and static 
symbols only. This option saves some space in the 
output file. 
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-L dir Changes the algorithm for searching for libraries to 
look in dir before looking in the default location. 
This option is for Id libraries as the -I option is for 
compiler #include files. The "-L" option is useful for 
finding libraries that are not in the standard library 
directory. To be useful, this option must appear 
before the "-1" option. 

-N Places the data section immediately following the 

text section in memory and stores the magic number 
0407 in the UNIX system header. This prevents the 
text from being shared (the default). 

-V Prints on the standard error output a "version id" 

identifying the Id being run, 

-VS num Takes num as a decimal version number identifying 
the a.out file that is produced. The version stamp is 
stored in the UNIX system header. 

-n Separate text data/bss, shared text not writable. 
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LINK EDITOR COMMAND LANGUAGE 



Expressions 

Expressions may contain global symbols, constants, and most of 
the basic C language operators. (See Figure 17-2, " SYNTAX 
DIAGRAM FOR INPUT DIRECTIVES" .) Constants are as in 
C with a number recognized as decimal unless preceded with 
"0" for octal or "Ox" for hexadecimal. All numbers are treated 
as long ints. Symbol names may contain uppercase or 
lowercase letters, digits, and the underscore (*_'). Symbols 
within an expression have the value of the address of the 
symbol only. The Id does not do symbol table lookup to find the 
contents of a symbol, the dimensionality of an array, structure 
elements declared in a C program, etc. 

The Id uses a lex-generated input scanner to identify symbols, 
numbers, operators, etc. The current scanner design makes the 
following names reserved and unavailable as symbol names or 
section names: 



ALIGN 

ASSIGN 

BLOCK 



DSECT 
GROUP 
LENGTH 



MEMORY 

NOLOAD 

ORIGIN 



PHY 

RANGE 

REGION 



SECTIONS 

SPARE 

TV 



align group 
assign 1 
block len 



length origin 
phy 

org range 



spare 



The operators that are supported, in order of precedence from 
high to low, are shown in Figure 17-1: 
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symbol 



r-(UNARY Minus) 



/ % 



+ -(BINARY Minus) 



» « 






> < <= >= 



& 



&& 



= += .= *= /= 



Figure 17-1. Symbols and Functions of Operators 



The above operators have the same meaning as in the C 
language. Operators on the same line have the same 
precedence. 



Assignment Statements 

External symbols may be defined and assigned addresses via 
the assignment statement. The syntax of the assignment 
statement is 

symbol = expression; 



or 



symbol op= expression; 
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where op is one of the operators +, -, *, or /. 

Assignment statements must be terminated by a semicolon. 

All assignment statements (with the exception of the one case 
described in the following paragraph) are evaluated after 
allocation has been performed. This occurs after all input-file- 
defined symbols are appropriately relocated but before the 
actual relocation of the text and data itself. Therefore, if an 
assignment statement expression contains any symbol name, 
the address used for that symbol in the evaluation of the 
expression reflects the symbol address in the output object file. 
References within text and data (to symbols given a value 
through an assignment statement) access this latest assigned 
value. Assignment statements are processed in the same order 
in which they are input to Id. 

Assignment statements are normally placed outside the scope 
of section-definition directive (see "Section Definition 
Directive" under "LINK EDITOR COMMAND LANGUAGE"). 
However, there exists a special symbol, called ".", that can 
occur only within a section-definition directive. This symbol 
refers to the current R address of the Id's location counter. 
Thus, assignment expressions involving ". " are evaluated during 
the allocation phase of Id. Assigning a value to the "." symbol 
within a section-definition directive increments/resets Id's 
location counter and can create "holes" within the section, as 
described in " Section Definition Directives" . Assigning the 
value of the "." symbol to a conventional symbol permits the 
final allocated address (of a particular point within the link 
edit run) to be saved. 

Align is provided as a shorthand notation to allow alignment of 
a symbol to an n-byte boundary within an output section, where 
% is a power oi 2. r or example, tue expression 

align(n) 
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is equivalent to 

(. + n - 1) &~(n - 1) 

Link editor expressions may have either an absolute or a 
relocatable value. When the Id creates a symbol through an 
assignment statement, the symbol's value takes on that type of 
expression. That type depends on the following rules: 

• An expression with a single relocatable symbol (and zero or 
more constants or absolute symbols) is relocatable. The 
value is in relation to the section of the referenced symbol. 

• All other expressions have absolute values. 

Specifying a Memory Configuration 

MEMORY directives are used to specify 

a. The total size of the virtual space of the target 
machine. 

b. The configured and unconfigured areas of the 
virtual space. 

If no directives are supplied, the Id assumes that all memory is 
configured. The size of the default memory is dependent upon 
the target machine. 

By means of MEMORY directives, an arbitrary name of up to 
eight characters is assigned to a virtual address range. Output 
sections can then be forced to be bound to virtual addresses 
within specifically named memory areas. Memory names may 
contain uppercase or lowercase letters, digits, and the special 
characters '$', '.', or '_'. Names of memory ranges are used by 
Id only and are not carried in the output file symbol table or 
headers. 

17-13 



LINK EDITOR 



When MEMORY directives are used, all virtual memory not 
described in a MEMORY directive is considered to be 
unconfigured. Unconfigured memory is not used in the Id's 
allocation process, and hence nothing can be link edited, bound, 
or assigned to any address within unconfigured memory. 

As an option on the MEMORY directive, attributes may be 
associated with a named memory area. This restricts the 
memory areas (with specific attributes) to which an output 
section can be bound. The attributes assigned to output 
sections in this manner are recorded in the appropriate section 
headers in the output file to allow for possible error checking in 
the future. For example, putting a text section into writable 
memory is one potential error condition. Currently, error 
checking of this type is not implemented. 

The attributes currently accepted are 

a. R : readable memory. 

b. W : writable memory. 

c. X : executable, i.e., instructions may reside in this 
memory. 

d. I : initializable, i.e., stack areas are typically not 
initialized. 

Other attributes may be added in the future if necessary. If no 
attributes are specified on a MEMORY directive or if no 
MEMORY directives are supplied, memory areas assume the 
attributes of W, R, I, and X. 

The syntax of the MEMORY directive is 
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MEMORY 

{ 

namel (attr) : origin = nl, length = n2 

name2 (attr) : origin = n3, length = n4 

etc. 

} 



The keyword "origin" (or "org" or "o") must precede the origin 
of a memory range, and "length" (or "len" or "1") must precede 
the length as shown in the above prototype. The origin operand 
refers to the virtual address of the memory range. Origin and 
length are entered as long integer constants in either decimal, 
octal, or hexadecimal (standard C syntax). Origin and length 
specifications, as well as individual MEMORY directives, may 
be separated by white space or a comma. 

By specifying MEMORY directives, the Id can be told that 
memory is configured in some manner other than the default. 
For example, if it is necessary to prevent anything from being 
linked to the first 0x10000 words of memory, a MEMORY 
directive can accomplish this. 

MEMORY 

{ 

valid : org = 0x10000, len = OxFEOOOO 

} 



Section Definition Directives 

The purpose of the SECTIONS directive is to describe how 
input sections are to be combined, to direct where to place 
output sections (both in relation to each other and to the entire 
virtual memory space), and to permit the renaming of output 
sections. 
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In the default case where no SECTIONS directives are given, 
all input sections of the same name appear in an output section 
of that name. For example, if a number of object files from the 
compiler are linked, each containing the three sections .text, 
.data, and .bss, the output object file also contains three 
sections, .text, .data, and .bss. If two object files are linked (one 
that contains sections si and s2 and the other containing 
sections s3 and s4), the output object file contains the four 
sections si, s2, s3, and s4. The order of these sections would 
depend on the order in which the link editor sees the input 
files. 

The basic syntax of the SECTIONS directive is 

SECTIONS 

{ 

secnamel : 

{ 

file_specifications, 

assignment_statements* 

} 
secname2 : 

{ 

file_specifications, 

assignment_statements * 

} 
etc. 

} 

The various types of section definition directives are discussed 
in the remainder of this section. 



These may be intermixed. 
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File Specifications 

Within a section definition, the files and sections of files to be 
included in the output section are listed in the order in which 
they are to appear in the output section. Sections from an 
input file are specified by 



or 



filename ( secname ) 



filename ( secnaml secnam2 . . . ) 



Sections of an input file are separated either by white space or 
commas as are the file specifications themselves. 

If a file name appears with no sections listed, then all sections 
from the file are linked into the current output section. For 
example, 

SECTIONS 

{ 

outsecl: 

{ 

filel.o (seel) 

file2.o 

fileS.o (seel, sec2) 

} 
} 

The order in which the input sections appears in the output 
section "outsecl" is given by 
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a. Section seel from file filel.o 

b. All sections from file2.o, in the order they appear 
in the file 

c. Section seel from file fileS.o, and then section sec2 
from file fileS.o. 

If there are any additional input files that contained input 
sections also named "outsecl", these sections are linked 
following the last section named in the definition of "outsecl". 
If there are any other input sections in filel.O or fileS.O, they 
will be placed in output sections with the same names as the 
input sections unless they are included in other file 
specifications. 



Load a Section at a Specified Address 

T^rk-nHinrr r%-f oni rkiTf-rvn-f eort+irkvi i-r\ o cfT\an•tT^r* Trii^-fnol pH/1i«Qaa io 

accomplished by an Id option as shown on the following 
SECTIONS directive example: 



SECTIONS 

{ 

outsec addr: 

{ 

} 
etc. 

} 

The "addr" is the bonding address expressed as a C constant. 
If "outsec" does not fit at "addr" (perhaps because of holes in 
the memory configuration or because "outsec" is too large to fit 
without overlapping some other output section), Id issues an 
appropriate error message. 
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So long as output sections do not overlap and there is enough 
space, they can be bound anywhere in configured memory. The 
SECTIONS directives defining output sections need not be given 
to Id in any particular order. 

The Id does not ensure that each section's size consists of an 
even number of bytes or that each section starts on an even 
byte boundary. The assembler ensures that the size (in bytes) 
of a section is evenly divisible by 4. The Id directives can be 
used to force a section to start on an odd byte boundary 
although this is not recommended. If a section starts on an odd 
byte boundary, the section's contents are either accessed 
incorrectly or are not executed properly. When a user specifies 
an odd byte boundary, the Id issues a warning message. 



Aligning an Output Section 

It is possible to request that an output section be bound to a 
virtual address that falls on an n-byte boundary, where n is a 
power of 2. The ALIGN option of the SECTIONS directive 
performs this function, so that the option 



ALIGN(n) 
is equivalent to specifying a bonding address of 

( . + n - 1) &-iTi - 1) 
For example 
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SECTIONS 

{ 

outsec ALIGN(0x20000) : 

{ 

} 
etc. 

} 

The output section "outsec" is not bound to any given address 
but is linked to some virtual address that is a multiple of 
0x20000 (e.g., at address 0x0, 0x20000, 0x40000, 0x60000, etc.). 



Grouping Sections Together 

The default allocation algorithm for Id 

a. Links all input .text sections together into one 
output section. This output section is called .text 
and is bound to an address of 0x0. 

b. Links all input .data sections together into one 
output section. This output section is called .data 
and is bound to an address aligned to a machine 
dependent constant. 

c. Links all input .bss sections together into one 
output section. This output section is called .bss 
and is allocated so as to immediately follow the 
output section .data. Note that the output section 
.bss is not given any particular address alignment. 

Specifying any SECTIONS directives results in this default 
allocation not being performed. 

The default allocation of Id is equivalent to supplying the 
following directive: 
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SECTIONS 

{ 

.text : { } 

GROUP ALIGN( align_value ) : 

{ 

.data : { } 

.bss : { } 

} 
\ 

where align_value is a machine dependent constant. The 
GROUP command ensures that the two output sections, .data 
and .bss, are allocated (e.g., "grouped") together. Bonding or 
alignment information is supplied only for the group and not 
for the output sections contained within the group. The 
sections making up the group are allocated in the order listed 
in the directive. 

If .text, .data, and .bss are to be placed in the same segment, the 
following SECTIONS directive is used: 

SECTIONS 

{ 

GROUP : 

{ 

.text : { } 

.data : { } 

.bss : { } 

} 
} 

Note that there are still three output sections (.text, .data, and 
.bss), but now they are allocated into consecutive virtual 
memory. 

This entire group of output sections could be bound to a 
starting address or aligned simply by adding a field to the 
GROUP directive. To bind to OxCOOOO, use 
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GROUP OxCOOOO : { 

To align to 0x10000, use 

GROUP ALIGN(OxlOOOO) : { 

With this addition, first the output section .text is bound at 
OxCOOOO (or is aligned to 0x10000); then the remaining 
members of the group are allocated in order of their 
appearance into the next available memory locations. 

When the GROUP directive is not used, each output section is 
treated as an independent entity: 

SECTIONS 

{ 

.text : { } 

.data ALIGN(0x20000) : { } 

.bss : { } 

} 

The .text section starts at virtual address 0x0 and the .data 
section at a virtual address aligned to 0x20000. The .bss section 
follows immediately after the .text section if there is enough 
space. If there is not, it follows the .data section. 

The order in which output sections are defined to the Id cannot 
be used to force a certain allocation order in the output file. 

Creating Holes Within Output Sections 

The special symbol dot (.) appears only within section 
definitions and assignment statements. When it appears on the 
left side of an assignment statement, "." causes the Id's 
location counter to be incremented or reset and a "hole" left in 
the output section. "Holes" built into output sections in this 
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manner take up physical space in the output file and are 
initialized using a fill character (either the default fill 
character (0x00) or a supplied fill character). See the definition 
of the "-f" option in "USING THE LINK EDITOR" and the 
discussion of filling holes in "Initialized Section Holes or .bss 
Sections" under "LINK EDITOR COMMAND LANGUAGE". 

Consider the following section definition: 



outsec: 

{ 



. += 0x1000; 
fl.o (.text) 
. += 0x100; 
f2.o (.text) 
. = align (4); 
f3.o (.text) 



The effect of this command is as follows: 



a. A 0x1000 byte hole, filled with the default fill 
character, is left at the beginning of the section. 
Input file fl.o(.text) is linked after this hole. 

b. The text of input file f2.o begins at 0x100 bytes 
following the end of fl.o(.text). 

c. The text of fS.o is linked to start at the next full 
word boundary following the text of f2.o with 
respect to the beginning of "outsec". 

For the purposes of allocating and aligning addresses within an 
output section, the Id treats the output section as if it began at 
address zero. As a result, if, in the above example, "outsec" 
ultimately is linked to start at an odd address, then the part of 
"outsec" built from f3.o(.text) also starts at an odd address— 
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even though f3.o(.text) is aligned to a full word boundary. This 
is prevented by specifying an alignment factor for the entire 
output section. 

outsec ALIGN(4) : { 

It should be noted that the assembler, as, always pads the 
sections it generates to a full word length making explicit 
alignment specifications unnecessary. This also holds true for 
the compiler. 

Expressions that decrement "." are illegal. For example, 
subtracting a value from the location counter is not allowed 
since overwrites are not allowed. The most common operators 
in expressions that assign a value to "." are "+=" and "align". 



Creating and Defining Symbols at Link-Edit Time 

The assignment instruction of the Id can be used to give 
symbols a value that is link-edit dependent. Typically, there 
are three types of assignments: 



a. Use of **." to adjust Id's location counter during allocation 

b. Use of "." to assign an allocation-dependent value to a 
symbol 

c. Assigning an allocation-independent value to a symbol. 
Case a) has already been discussed in the previous section. 

Case b) provides a means to assign addresses (known only after 
allocation) to symbols. For example 
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SECTIONS 

{ 

outscl: {...} 
outsc2: 

{ 



filel.o (si) 
s2_start = . ; 
file2.o (s2) 
s2_end = . - 1; 



} 



The symbol "s2_start" is defined to be the address of file2.o(s2), 
and **s2_end" is the address of the last byte of file2.o(s2). 

Consider the following example: 

SECTIONS 

{ 

outscl: 

{ 

filel.o (.data) 

mark = .; 

.+=4; 

file2.o (.data) 
} 
} 

In this example, the symbol "mark" is created and is equal to 
the address of the first byte beyond the end of filel.o's .data 
section. Four bytes are reserved for a future run-time 
initialization of the symbol mark. The type of the symbol is a 
long integer (32 bits). 

• 

Assignment instructions involving "." must appear within 
SECTIONS definitions since they are evaluated during 
allocation. Assignment instructions that do not involve "." can 
appear within SECTIONS definitions but typically do not. Such 
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instructions are evaluated after allocation is complete. 
Reassignment of a defined symbol to a different address is 
dangerous. For example, if a symbol within .data is defined, 
initialized, and referenced within a set of object files being 
link-edited, the symbol table entry for that symbol is changed 
to reflect the new, reassigned physical address. However, the 
associated initialized data is not moved to the new address. 
The Id issues warning messages for each defined symbol that is 
being redefined within an ifile. However, assignments of 
absolute values to new symbols are safe because there are no 
references or initialized data associated with the symbol. 



Allocating a Section Into Named Memory 

It is possible to specify that a section be linked (somewhere) 
within a specific named memory (as previously specified on a 
MEMORY directive). (The ">" notation is borrowed from the 
UNIX system concept of "redirected output".) 



For example 

MEMORY 

{ 

meml: o=OxOOOOOO 1=0x10000 

mem2 (RW): o=0x020000 1=0x40000 

memS (RW): o=0x070000 1=0x40000 

meml: o=0xl20000 1=0x04000 

} 

SECTIONS 

{ 

outsecl: { fl.o(.data) } > meml 
outsec2: { f2.o(.data) } > mem3 

} 

This directs Id to place "outsecl" anywhere within the memory 
area named "meml" (i.e., somewhere within the address range 
OxO-OxFFFF or 0xl20000-0xl23FF). The "outsec2" is to be 
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placed somewhere in the address range OxTOOOO-OxAFFFF. 



Initialized Section Holes or BSS Sections 

When "holes" are created within a section (as in the example in 
"LINK EDITOR COMMAND LANGUAGE"), the Id normally 
puts out bytes of zero as "fill". By default, .hss sections are not 
initialized at all; that is, no initialized data is generated for any 
.bss section by the assembler nor supplied by the link editor, 
not even zeros. 



Initialization options can be used in a SECTIONS directive to 
set such "holes" or output .hss sections to an arbitrary 2-byte 
pattern. Such initialization options apply only to .bss sections 
or "holes". As an example, an application might want an 
uninitialized data table to be initialized to a constant value 
without recompiling the ".o" file or a "hole" in the text area to 
be filled with a transfer to an error routine. 

Either specific areas within an output section or the entire 
output section may be specified as being initialized. However, 
since no text is generated for an uninitialized .bss section, if 
part of such a section is initialized, then the entire section is 
initialized. In other words, if a .bss section is to be combined 
with a .text or .data section (both of which are initialized) or if 
part of an output .bss section is to be initialized, then one of the 
following will hold: 

a. Explicit initialization options must be used to 
initialize all .bss sections in the output section, 

b. The Id will use the default fill value to initialize all 
.bss sections in the output section. 

Consider the following Id ifile: 
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SECTIONS 

{ 



} 



seel: 

{ 

fl.o 

. =+ 0x200; 

£2.0 (.text) 
= OxDFFF 
sec2: 

fl.o (.bss) 

£2.0 (.bss) = 0x1234 

sec3: 

£3.0 (.bss) 

= OxFFFF 
sec4: { £4.0 (.bss) } 



In the example above, the 0x200 byte "hole" in section "seel" is 
filled with the value OxDFFF. In section "sec2", £l.o(.bss) is 
initialized to the default fill value of 0x00, and £2.o(.bss) is 
initialized to 0x1234. All .bss sections within "sec3" as well as 
all "holes" are initialized to OxFFFF. Section "see4" is not 
initialized; that is, no data is written to the object file for this 
section. 
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NOTES AND SPECIAL CONSIDERATIONS 



Changing the Entry Point 

The a.out header contains a field for the (primary) entry point 
of the file. This field is set using one of the following rules 
(listed in the order they are applied): 



a. The value of the symbol specified with the "-e" 
option, if present, is used. 

b. The value of the symbol "_start", if present, is 
used. 

c. The value of the symbol "main", if present, is used. 

d. The value zero is used. 

Thus, an explicit entry point can be assigned to this a.out 
header field through the "-e" option or by using an assignment 
instruction in an ifile of the form 

_start = expression; 

If the Id is called through cc(l), a startup routine is 
automatically linked in. Then, when the program is executed, 
the routine exit(l) is called after the main routine finishes to 
close file descriptors and do other cleanup. The user must 
therefore be careful when calling the Id directly or when 
changing the entry point. The user must supply the startup 
routine or make sure that the program always calls exit rather 
than falling through the end. Otherwise, the program will dump 
core. 
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Use of Archive Libraries 

Each member of an archive library (e.g., libc.a) is a complete 
object file typically consisting of the standard three sections: 
.text, .data, and .bss. Archive libraries are created through the 
use of the UNIX system '*ar" command from object files 
generated by running the cc or as. 

An archive library is always processed using selective inclusion: 
Only those members that resolve existing undefined-symbol 
references are taken from the library for link editing. 

Libraries can be placed both inside and outside section 
definitions. In both cases, a member of a library is included for 
linking whenever 

a. There exists a reference to a symbol defined in 
that member. 

b. The reference is found by the Id prior to the actual 
scanning of the library. 

When a library member is included by searching the library 
inside a SECTIONS directive, all input sections from the 
library member are included in the output section being 
defined. When a library member is included by searching the 
library outside of a SECTIONS directive, all input sections from 
the library member are included into the output section with 
the same name. That is, the .text section of the member goes 
into the output section named .text, the .data section of the 
member into .data, the .bss section of the member into .bss, etc. 
If necessary, new output sections are defined to provide a place 
to put the input sections. Note, however, that 

a. Specific members of a library cannot be referenced 
explicitly in an ifile. 
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b. The default rules for the placement of members 
and sections cannot be overridden when they apply 
to archive library members. 

The "-1" option is a shorthand notation for specifying an input 
file coming from a predefined set of directories and having a 
predefined name. By convention, such files are archive 
libraries. However, they need not be so. Furthermore, archive 
libraries can be specified without using the "-1" option by 
simply giving the (full or relative) UNIX system file path. 

The ordering of archive libraries is important since for a 
member to be extracted from the library it must satisfy a 
reference that is known to be unresolved at the time the library 
is searched. Archive libraries can be specified more than once. 
They are searched every time they are encountered. Archive 
files have a symbol table at the beginning of the archive. The 
Id will cycle through, this symbol table until it has determined 
that it cannot resolve any more references from that library. 

Consider the following example: 

a. The input files filel.o and file2.o each contain a 
reference to the external function FCN. 

b. Input filel.o contains a reference to symbol ABC. 

c. Input file2.o contains a reference to symbol XYZ. 

d. Library liba.a, member 0, contains a definition of 
XYZ. 

e. Library libc.a, member 0, contains a definition of 
ABC. 

f. Both libraries have a member 1 that defines FCN. 
If the Id command were entered as 
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Id filel.o -la file2.o -Ic 

then the FCN references are satisfied by liba.a, member 1, ABC 
is obtained from libc.a, member 0, and XYZ remains undefined 
(since the library liba.a is searched before file2.o is specified). 
If the Id command were entered as 

Id filel.o file2.o -la -Ic 

then the FCN references are satisfied by liba.a, member 1, ABC 
is obtained from libc.a, member 0, and XYZ is obtained from 
liba.a, member 0. If the Id command were entered as 

Id filel.o file2.o -Ic -la 

then the FCN references are satisfied by libc.a, member 1, ABC 
is obtained from libc.a, member 0, and XYZ is obtained from 
liba.a, member 0. 

The "-u" option is used to force the linking of library members 
when the link edit run does not contain an actual external 
reference to the members. For example. 

Id -u routl -la 

creates an undefined symbol called "routl" in the Id's global 
symbol table. If any member of library liba.a defines this 
symbol, it (and perhaps other members as well) is extracted. 
Without the "-u" option, there would have been no "trigger" to 
cause Id to search the archive library. 
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Dealing With Holes in Physical Memory 

When memory configurations are defined such that 
unconfigured areas exist in the virtual memory, each 
application or user must assume the responsibility of forming 
output sections that will fit into memory. For example, assume 
that memory is configured as follows: 



MEMORY 

{ 

meml: 

mem2: 

mem3: 

} 



= 0x00000 1 = 0x02000 

o = 0x40000 1 = 0x05000 

= 0x20000 1 = 0x10000 



Let the files fl.o, f2.o, . . . fn.o each contain the standard three 
sections .text, .data, and .bss, and suppose the combined .text 
section is 0x12000 bytes. There is no configured area of 
memory in which this section can be placed. Appropriate 
directives must be supplied to break up the .text output section 
so Id may do allocation. For example. 



SECTIONS 

{ 

txtl: 

{ 



} 
txt2: 

{ 



} 
etc. 



fl.o (.text) 
f2.o (.text) 
f3.o (.text) 



f4.o (.text) 
f5.o (.text) 
f6.o (.text) 
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Allocation Algorithm 

An output section is formed either as a result of a SECTIONS 
directive or by combining input sections of the same name. An 
output section can have zero or more input sections comprising 
it. After the composition of an output section is determined, it 
must then be allocated into configured virtual memory. Ld uses 
an algorithm that attempts to minimize fragmentation of 
memory, and hence increases the possibility that a link edit run 
will be able to allocate all output sections within the specified 
virtual memory configuration. The algorithm proceeds as 
follows: 

a. Any output sections for which explicit bonding 
addresses were specified are allocated. 

b. Any output sections to be included in a specific 
named memory are allocated. In both this and the 
succeeding step, each output section is placed into 
the first available space within the (named) 
memory with any alignment taken into 
consideration. 

c. Output sections not handled by one of the above 
steps are allocated. 

If all memory is contiguous and configured (the default case), 
and no SECTIONS directives are given, then output sections are 
allocated in the order they appear to the Id, normally .text, 
.data, .bss. Otherwise, output sections are allocated in the order 
they were defined or made known to the Id into the first 
available space they fit. 
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Incremental Link Editing 

As previously mentioned, the output of the Id can be used as an 
input file to subsequent Id runs providing that the relocation 
information is retained ("-r" option). Large applications may 
find it desirable to partition their C programs into 
"subsystems", link each subsystem independently, and then link 
edit the entire application. For example, 

Step 1: 

Id -r -o outfilel ifilel 

/* ifilel */ 
SECTIONS 

{ 

ssl: 

{ 

fl.o 

f2.o 
fn.o 



Step 2: 

Id -r -0 outfile2 ifile2 

/* ifile2 */ 
SECTIONS 

{ 

ss2: 

{ 

gl.o 

g2.o 
gn.o 
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Step 3: 

Id -a -m -0 final.out outfilel outfile2 



By judiciously forming subsystems, applications may achieve a 
form of "incremental link editing" wher6by it is necessary to 
relink only a portion of the total link edit when a few programs 
are recompiled. 

To apply this technique, there are two simple rules 

a. Intermediate link edits should contain only 
SECTIONS declarations and be concerned only 
with the formation of output sections from input 
files and input sections. No binding of output 
sections should be done in these runs. 

b. All allocation and memory directives, as well as 
any assignment statements, are included only in 
the final Id call. 



DSECT, COPY, and NOLOAD Sections 

Sections may be given a "type" in a section definition as shown 
in the following example: 



SECTIONS 

{ 

namel 0x200000 (DSECT) : { filel.o } 

name2 0x400000 (COPY) : { file2.o } 
names 0x600000 (NOLOAD) : { fileS.o } 

} 

The DSECT option creates what is called a "dummy section". 
A "dummy section" has the following properties: 
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a. It does not participate in the memory allocation 
for output sections. As a result, it takes up no 
memory and does not show up in the memory map 
(the "-m" option) generated by the Id. 

b. It may overlay other output sections and even 
unconfigured memory. DSECTs may overlay other 
DSECTs. 

c. The global symbols defined within the "dummy 
section" are relocated normally. That is, they 
appear in the output file's symbol table with the 
same value they would have had if the DSECT 
were actually loaded at its virtual address. 
DSECT-defined symbols may be referenced by 
other input sections. Undefined external symbols 
found within a DSECT cause specified archive 
libraries to be searched and any members which 
define such symbols are link edited normally (i.e., 
not in the DSECT or as a DSECT). 

d. None of the section contents, relocation 
information, or line number information 
associated with the section is written to the output 
file. 

In the above example, none of the sections from filel.o are 
allocated, but all symbols are relocated as though the sections 
were link edited at the specified address. Other sections could 
refer to any of the global symbols and they are resolved 
correctly. 

A "copy section" created by the COPY option is similar to a 
"dummy section". The only difference between a "copy section" 
and a "dummy section" is that the contents of a "copy section" 
and all associated information is written to the output file. 

A section with the "type" of NOLOAD differs in only one 
respect from a normal output section: its text and/or data is not 
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written to the output file. A NOLOAD section is allocated 
virtual space, appears in the memory map, etc. 



Output File Blocking 

The BLOCK option (applied to any output section or GROUP 
directive) is used to direct Id to align a section at a specified 
byte offset in the output file. It has no effect on the address at 
which the section is allocated nor on any part of the link edit 
process. It is used purely to adjust the physical position of the 
section in the output file. 

SECTIONS 

{ 

.text BLOCK(0x200) : { } 

.data ALIGN(0x20000) BLOCK(0x200) : { } 

} 

With this SECTIONS directive. Id assures that each section, 
.text and .data, is physically written at a file offset which is a 
multiple of 0x200 (e.g., at an offset of 0, 0x200, 0x400,..., etc. in 
the file). 



Nonrelocatable Input Files 

If a file produced by the Id is intended to be used in a 
subsequent Id run, the first Id run has the "-r" option set. This 
preserves relocation information and permits the sections of the 
file to be relocated by the subsequent Id run. 



When the Id detects an input file (that does not have relocation 
or symbol table information), a warning message is given. Such 
information can be removed by the Id (see the "-a" and "-s" 
options in the part USING THE LINK EDITOR) or by the 
strip{l) program. However, the link edit run continues using 
the nonrelocatable input file. 
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For such a link edit to be successful (i.e., to actually and 
correctly link edit all input files, relocate all symbols, resolve 
unresolved references, etc.), two conditions on the 
nonrelocatable input files must be met. 

a. Each input file must have no unresolved external 
references. 

b. Each input file must be bound to the exact same 
virtual address as it was bound to in the Id run 
that created it. 

Note that if these two conditions are not met for all 

nonrelocatable input files, no error messages are issued. 

Because of this fact, extreme care must be taken when 
supplying such input files to the Id. 



ERROR MESSAGES 

Corrupt Input Files 

The following error messages indicate that the input file is 
corrupt, nonexistent, or unreadable. The user should check that 
the file is in the correct directory with the correct permissions. 
If the object file is corrupt, try recompiling or reassembling it. 

• Can't open name 

• Can't read archive header from archive Tiame 

• Can't read file header of archive name 

• Can't read 1st word of file name 

• Can't seek to the beginning of file name 
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Fail to read file header of name 

Fail to read Inno of section sect of file name 

Fail to read magic number of file name 

Fail to read section headers of file name 

Fail to read section headers of library name member 
number 

Fail to read symbol table of file name 

Fail to read symbol table when searching libraries 

Fail to read the aux entry of file name 

Fail to read the field to be relocated 

Fail to seek to symbol table of file nanne 

Pail to seek to symbol table when searching libraries 

Fail to seek to the end of library nam^e member number 

Fail to skip aux entries when searching libraries 

Pail to skip the mem of struct of name 

Illegal relocation type 

No reloc entry found for symbol 

Reloc entries out of order in section sect of file name 

Seek to name section sect failed 

Seek to name section sect Inno failed 
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• Seek to name section sect reloc entries failed 

• Seek to relocation entries for section sect in file name 
failed. 



Errors During Output 

These errors occur because the Id cannot write to the output 
file. This usually indicates that the file system is out of space. 



• Cannot complete output file name. Write error. 

• Fail to copy the rest of section num of file na7ne 

• Fail to copy the bytes that need no reloc of section num of 
file 

• name I/O error on output file name. 

Internal Errors 

These messages indicate that something is wrong with the Id 
internally. There is probably nothing the user can do except get 
help. 

• Attempt to free nonallocated memory 

• Attempt to reinitialize the SDP aux space 

• Attempt to reinitialize the SDP slot space 

• Default allocation did not put .data and .bss into the same 
region 

• Failed to close SDP symbol space 

• Failure dumping an AIDFNxxx data structure 
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Failure in closing SDP aux space 

Failure to initialize the SDP aux space 

Failure to initialize the SDP slot space 

Internal error: audit_groups, address mismatch 

Internal error: audit_group, finds a node failure 

Internal error: fail to seek to the member of name 

Internal error: in allocate lists, list confusion (num num) 

Internal error: invalid aux table id 

Internal error: invalid symbol table id 

Internal error: negative aux table Id 

Internal error: negative symbol table id 

Internal error: no symtab entry for DOT 

Internal error: split_scns, size of sect exceeds its new 
displacement. 

Allocation Errors 

These error messages appear during the allocation phase of the 
link edit. They generally appear if a section or group does not 
fit at a certain address or if the given MEMORY or SECTION 
directives in some way conflict. If you are using an ifile, check 
that MEMORY and SECTION directives allow enough room for 
the sections to ensure that nothing overlaps and that nothing is 
being placed in unconfigured memory. For more information, 
see "LINK EDITOR COMMAND LANGUAGE" and "NOTES 
AND SPECIAL CONSIDERATIONS". 
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• Bond address address for sect is not in configured memory 

• Bond address address for sect overlays previously allocated 
section sect at address 

• Can't allocate output section sect, of size num 

• Can't allocate section sect into owner mem 

• Default allocation failed: name is too large 

• GROUP containing section sect is too big 

• Memory types namel and name2 overlap 

• Output section sect not allocated into a region 

• Sect at address overlays previously allocated section sect at 
address 

• Sect, bonded at address, won't fit into configured memory 

• Sect enters unconfigured memory at address 

• Section sect in file name is too big. 

Misuse of Link Editor Directives 

These errors arise from the misuse of an input directive. Please 
review the appropriate section in the manual. 

• Adding name(sect) to multiple output sections. 

The input section is mentioned twice in the SECTION directive. 

• Bad attribute value in MEMORY directive: c. 
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An attribute must be one of "R", 'W, "X", or "I". 

• Bad flag value in SECTIONS directive, option. 

Only the "-1" option is allowed inside of a SECTIONS directive. 

• Bad fill value. 

The fill value must be a 2-byte constant. 

• Bonding excludes alignment. 

The section will be bound at the given address regardless of the 
alignment of that address. 

• Cannot align a section within a group 

• Cannot bond a section within a group 

• Cannot specify an owner for sections within a group. 

The entire group is treated as one unit, so the group may be 
aligned or bound to an address, but the sections making up the 
group may not be handled individually. 

• DSECT sect can't be given an owner 

• DSECT sect can't be linked to an attribute. 

Since dummy sections do not participate in the memory 
allocation, it is meaningless for a dummy section to be given an 
owner or an attribute. 

• Region commands not allowed 
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The UNIX system link editor does not accept the REGION 
commands. 



• Section sect not built. 

The most likely cause of this is a syntax error in the 
SECTIONS directive. 

• Semicolon required after expression 

• Statement ignored. 

Caused by a syntax error in an expression. 

• Usage of unimplemented syntax. 

The UNIX system Id does not accept all possible Id commands. 

Misuse of Expressions 

These errors arise from the misuse of an input expression. 
Please review the appropriate section in the manual. 

• Absolute symbol name being redefined. 
An absolute symbol may not be redefined. 

• ALIGN illegal in this context. 

Alignment of a symbol may only be done within a SECTIONS 
directive. 

• Attempt to decrement DOT 
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• Illegal assignment of physical address to DOT. 
® Illegal operator in expression 

• Misuse of DOT symbol in assignment instruction. 

The DOT symbol (".") cannot be used in assignment statements 
that are outside SECTIONS directives. 

• Symbol name is undefined. 

All symbols referenced in an assignment statement must be 
defined. 

• Symbol name from file name being redefined. 

A defined symbol may not be redefined in an assignment 
statement. 

• Undefined symbol in expression. 

Misuse of Options 

These errors arise from the misuse of options. Please review 
the appropriate section of the manual. 

• Both -r and -s flags are set. The -s flag is turned off. 
Further relocation requires a symbol table. 

• Can't find library liba^.a 

• -L path too long {string) 
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• -o file name too large (>128 char), truncated to (string) 

• Too many -L options, seven allowed. 

Some options require white space before the argument, some do 
not; see "USING THE LINK EDITOR". Including extra white 
space or not including the required white space is the most 
likely cause of the following messages. 

• option flag does not specify a number 

• option is an invalid flag 

• -e flag does not specify a legal symbol name name 

• -f flag does not specify a 2-byte number 

• No directory given with -L 

• -0 flag does not specify a valid file name: string 

• the -1 flag (specifying a default library) is not supported 

• -u flag does not specify a legal symbol name: name. 

Space Restraints 

The following error messages may occur if the Id attempts to 
allocate more space than is available. The user should attempt 
to decrease the amount of space used by the Id. This may be 
accomplished by making the ifile less complicated or by using 
the "-r" option to create intermediate files. 

• Fail to allocate num bytes for slotvec table 

• Internal error: aux table overflow 
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• Internal error: symbol table overflow 

• Memory allocation failure on num-hyte 'calloc' call 

• Memory allocation failure on realloc call 

• Run is too large and complex. 

Miscellaneous Errors 

These errors occur for many reasons. Refer to the error 
message for an indication of where to look in the manual. 

• Archive symbol table is empty in archive name, execute 'ar 
ts name' to restore archive symbol table . 

On systems with a random access archive capability, the link 
editor requires that all archives have a symbol table. This 
symbol table may have been removed by strip. 

• Cannot create output file name . 

The user may not have write permission in the directory where 
the output file is to be written. 

• File name has no relocation information. 

See "NOTES AND SPECIAL CONSIDERATIONS." 

• File name is of unknown type, magic number = num 

• Ifile nesting limit exceeded with file name. 

Ifiles may be nested 16 deep. 
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• Library name, member has no relocation information. 

• Line nbr entry {num num) found for nonrelocatable 
symbol. 

Section sect, file name 

This is generally caused by an interaction of yacc{l) and cc(l). 
Re-yacc the offending file with the "-1" option of yacc. 

See the part " NOTES AND SPECIAL CONSIDERATIONS" . 

• Multiply defined symbol sym, in name has more than one 
size. 



A multiply defined symbol may not have been defined in the 
same manner in all files. 



• name{sect) not found. 

An input section specified in a SECTIONS directive was not 
found in the input file. 

• Section sect starts on an odd byte boundary! 

This will happen only if the user specifically binds a section at 
an odd boundary. 



Sections .text, .data, or .bss not found. Optional header may 
be useless. 



The UNIX system a.out header uses values found in the .text, 
.data, and .bss section headers. 
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• Undefined symbol sym first referenced in file name . 

Unless the -r option is used, the Id requires that all referenced 
symbols are defined. 

• Unexpected EOF (End Of File). 
Syntax error in the ifile. 



SYNTAX DIAGRAM FOR INPUT 
DIRECTIVES 

A syntax diagram for input directives is found in Figure 17-2. 
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directives 


-> 


expanded directives 


<file> 


-> 


{ <cmd> } 


<cmd> 


-> 


<memory> 




-> 


<sections> 




-> 


<assignment> 




-> 


<filename> 




-> 


<flags> 


<memory 


-> 


MEMORY { <memory_spec> 
{ [,] <memory_spec> }} 


<memory_spec> 


-> 


<name> [ <attributes> ] : 
<origin_spec> [,] <length_spec> 


<attributes> 


-> 


({R 1 W 1 X 1 I}) 


<origin_spec> 


-> 


<origin> = <long> 


<lenth_spec> 


-> 


<length> = <long> 


<origin> 


-> 


ORIGIN 1 o 1 org | origin 


<length> 


-> 


LENGTH 1 1 1 len 1 length 


<sections> 


-> 


SECTIONS { { <sec_or_group> } } 


<sec_or_group> 


-> 


<section> | <group> | <library> 


<group> 


-> 


GROUP <group_options> : { 

<section_list> } [<mem_spec>] 


<section_list> 


-> 


<section> { [,] <section> } 



Figure 17-2. Syntax Diagram for Input Directives 
(Sheet 1 of 4) 
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directives 


-> 


expanded directives 


<section> 


-> 


<name> <sec_options> : { 
<statement_list> } 
[<fill>] [<mem_spec>] 


<group_options> 


-> 


[<addr>] [<align_option>] 


<sec_options> 


-> 


[<addr>] [<align_option>] 
[<block_option>] [<type_option>] 


<addr> 


-> 


<long> 


<align_option> 


-> 


<align> ( <long> ) 


<align> 


-> 


ALIGN 1 align 


<block_option> 


-> 


<block> ( <long> ) 


<block> 


-> 


BLOCK 1 block 


<type_option> 


-> 


(DSECT) 1 (NOLOAD) | (COPY) 


<fill> 


-> 


= <long> 


<mem_spec> 


-> 


> <name> 




-> 


> <attributes> 


<statement> 


-> 


<file_iiame> [ ( <name_list> ) ] 
[<fill>] <library> <assignment> 


<name_list> 


-> 


<name> { [,] <name> } 


<library> 


-> 


-l<name> 


<assignment> 


-> 


<lside> <assign_op> <expr> <end> 


<lside> 


-> 


<name> | . 


<assign_op> 


-> 


= \+=\-=\*=]/ = 


<end> 


-> 


;l, 


<expr> 


-> 


<expr> <binary_op> <expr> 




-> 


<term> 


<binary_op> 


-> 


*|/|% 




-> 


+ 1- 




-> 


» |« 



Figure 17-2. Syntax Diagram for Input Directives 
(Sheet 2 of 4) 
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directives 


-> 


expanded directives 




-> 


==|!=l>l<l<=l>= 




-> 


& 




-> 


1 




-> 


&& 




-> 


II 


<term> 


-> 


<long> 




-> 


<name> 




-> 


<align> ( <term> ) 




-> 


( <expr ) 




-> 


<unary op> <term> 


<unary_op> 


-> 


!|- 


<flags> 


-> 


-e<wht_space><name> 




-> 


-f<wht_space><long> 




-> 


-h<wht_space><long> 




-> 


-l<name> 




-> 


-m 




-> 


-o<wht_space><filename> 




-> 


-r 




-> 


-s 




-> 


-t 




-> 


-u<wht_space><name> 




-> 


-z 




-> 


-H 




-> 


-L<pathname> 




-> 


-M 




-> 


-N 




-> 


-S 




-> 


-V 




-> 


-VS<wht_space><long> 




-> 


-a 




-> 


-X 



Figure 17-2. Syntax Diagram for Input Directives 
(Sheet 3 of 4) 
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directives 


-> 


expanded directives 


<name> 

<long> 

<wht_space> 


-> 
-> 
-> 


Any valid symbol name 

Any valid long integer constant 

Blanks, tabs, and newlines 


<filename> 


-> 


Any valid UNIX operating system 
filename. This may include a 
full or partial pathname. 


<pathname> 


-> 


Any valid UNIX operating system 
pathname (full or partial) 



Figure 17-2. Syntax Diagram for Input Directives 
(Sheet 4 of 4) 
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Chapter 18 

THE COMMON OBJECT FILE 
FORMAT 



GENERAL 

This chapter describes the Common Object File Format (COFF) 
used on several processors and operating systems, including the 
AT&T Technologies 3B Computer family and the UNIX 
operating system. The COFF is simple enough to be easily 
incorporated into existing projects, yet flexible enough to meet 
the needs of most projects. The COFF is the output file 
produced on some UNIX systems by the assembler (as) and the 
link editor (Id). This format is also used by other operating 
systems; hence, the word common is both descriptive and widely 
recognized. Currently, this object file format is used for the 
AT&T UNIX PC, AT&T Technologies 3B Computer, including 
the 3B20D, the 3B20S, the 3B5 and 3B2 Computers, and on the 
VAX*-ll/780 and 11/750 UNIX operating systems. Some key 
features of COFF are 

• Applications may add system-dependent information to the 
object file without causing access utilities to become 
obsolete 

• Space is provided for symbolic information used by 
debuggers and other applications 



• 



Users may make some modifications in the object file 
construction at compile time. 



* Trademark of Digital Equipment Corporation 
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The object file supports user-defined sections and contains 
extensive information for symbolic software testing. An object 
file contains 

• A file header 

• Optional header information 

• A table of section headers 

• Data corresponding to the section header 

• Relocation information 

• Line numbers 

• A symbol table 

• A string table. 

Figure 18-1 shows the overall structure. 
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FILE HEADER 



Optional Information 



Section 1 Header 



Section n Header 



Raw Data for Section 1 



Raw Data for Section n 



Relocation Info for Sect. 1 



Relocation Info for Sect, n 



Line Numbers for Sect. 1 



Line Numbers for Sect, n 



SYMBOL TABLE 



STRING TABLE 



Figure 18-1. Object File Format 



The last four sections (relocation, line numbers, symbol table, 
and the string table) may be missing if the program is linked 
with the — s option of the UNIX system link editor or if the line 
number information, symbol table, and string table are 
removed by the strip command. The line number information 
does not appear unless the program is compiled with the — g 
option of the compiler (CC) command. Also, if there are no 
unresolved external references after linking, the relocation 
information is no longer needed and is absent. The string table 
is also absent if the source file does not contain any symbols 
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with names longer than eight characters. 

An object file that contains no errors or unresolved references 
can be executed on the target machine. 



DEFINITIONS AND CONVENTIONS 

Before proceeding further, you should become familiar with the 
following terms and conventions: 



Sections 

A section is the smallest portion of an object file that is 
relocated and treated as one separate and distinct entity. In 
the default case, there are three sections named .text, .data, 
and .bss. Additional sections accommodate multiple text or 
data segments, shared data segments, or user-specified sections. 
However, the UNIX operating system loads only the .text, .data, 
and .bss into memory when the file is executed. 



Physical and Virtual Addresses 

The physical address of a section or symbol is the offset of that 
section or symbol from address zero of the address space. The 
term physical address as used in COFF does not correspond to 
the general usage. The physical address of an object is not 
necessarily the address at which the object is placed when the 
process is executed. For example, on a system with paging, the 
address is located with respect to address zero of virtual 
memory and the system performs another address translation. 
The section heading contains two address fields, a physical 
address, and a virtual address; but in all versions of COFF on 
UNIX systems, the physical address is equivalent to the virtual 
address. 
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FILE HEADER 

The file header contains the 20 bytes of information shown in 
Figure 18-2. The last 2 bytes are flags that are used by Id and 
object file utilities. 



Bytes 


Declaration 


Name 


Description 


0-1 


unsigned short 


f_magic 


Magic 

number, see 
Figure 18-3. 


2-3 


unsigned short 


f_nscns 


Number of 
section 
headers 
(equals the 
number of 
sections) 


4-7 


long int 


f_timdat 


Time and 
date stamp 
indicating 
when the file 
was created 
relative to the 
number of 
elapsed 
seconds since 
00:00:00 GMT, 
January 1, 
1970. 



Figure 18-2. File Header Contents (Sheet 1 of 2) 



18-5 



COFF 



Bytes 


Declaration 


Name 


Description 


8-11 


long int 


f_symptr 


File pointer 
containing 
the starting 
address of the 
symbol table 


12-15 


long int 


f_nsyms 


Number of 
entries in the 
symbol table 


16-17 


unsigned short 


f_opthdr 


Number of 
bytes in the 
optional 
header 


18-19 


unsigned short 


f_flags 


Flags (see 
Figure 18-4) 



Figure 18-2. File Header Contents (Sheet 2 of 2) 



The size of optional header information (f_opthdr) is used by 
all referencing programs that seek to the beginning of the 
section header table. This enables the same utility programs to 
work correctly on files targeted for different systems. 



Magic Numbers 

The magic number specifies the target machine on which the 
object file is executable. The currently defined magic numbers 
are in Figure 18-3. 
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Mnemonic 


Magic Number 


System 


N3B MAGIC 


0550 


3B20S Computers * 


FBOMAGIC 


0560 


3B2 and 3B5 

Computers * 


VAXWRMAGIC 


0570 


VAX-11/750 and 
VAX-11/780 
(writable text 
segments) 


VAXROMAGIC 


0575 


VAX-11/750 and 
VAX-11780 
(read-only text 
segments) 


MC68KRMAGIC 


0520 


Motorola (writable 
text segment) 


MC68KROMAGIC 


0521 


Motorola (read-only 
sharable text 
segment) 


MC68KPGMAGIC 


0522 


Motorola (demand-paged 
text segment) 


U370WRMAGIC 


0530 


IBM 370 (writable 
text segments) 


U370ROMAGIC 


0535 


IBM 370 (read-only 
sharable text 
segments) 



Figure 18-3. Magic Numbers 



* Trademark of AT&T 
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Flags 



xiiG ±aSi/ Li Byi/GS Oi uii6 ili6 ii6S,»j.6r a.r6 j.xa.^S uixS^u QGSCriuG 1/1x6 

type of the object file. The currently defined flags are given in 
Figure 18-4. 
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Mnemonic 


Flag 


Meaning 


F_RELFLG 


00001 


Relocation 
information 
stripped from the 
file 


F_EXEC 


00002 


File is executable 
(i.e., no 
unresolved 
external 
references) 


F_LNNO 


00004 


Line numbers 
stripped from the 
file 


F.LSYMS 


00010 


Local symbols 
stripped from the 
file 


F_MINMAL 


00020 


Not used by the 
UNIX system 


F_UPDATE 


00040 


Not used by the 
UNIX system 


F_SWABD 


00100 


Not used by the 
UNIX system 


F_AR16WR 


00200 


File has the byte 
ordering used by 
the PDP*-ll/70 
processor. 



Figure 18-4. File Header Flags (Sheet 1 of 2) 
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Mnemonic Flag Meaning 


F_AR32WR 


00400 


File has the byte 
ordering used by 
the VAX-11/780 
(i.e., 32 bits per 
word, least 
significant byte 
first). 


F_AR32W 


01000 


File has the byte 
ordering used by 
the UNIX PC 
and 3B 
computers (i.e., 
32 bits per word, 
most significant 
byte first). 


F.PATCH 


02000 


Not used by the 
UNIX system 


F_BM32ID 


0160000 


WE 32000 
processor ID 
field. 



Figure 18-4. File Header Flags (Sheet 2 of 2) 



* Trademark of Digital Equipment Corporation 



18-10 



COFF 



File Header Declaration 

The C structure declaration for the file header is given in 
Figure 18-5. This declaration may be found in the header file 
filehdr.h. 



struct filehdr { 






unsigned short 


f_magic; 


/* magic number */ 


unsigned short 


f_nscns; 


/* number of section * 


long 


f_timdat; 


/* time and data stamp /* 


long 


f_symptr; 


/* file ptr to symbol table */ 


long 


f-nsyms; 


/* number entries in the symbol table */ 


unsigned short 


f_opthdr; 


/* size of optional header */ 


unsigned short 


f_flags; 


/* flags */ 



}; 



#define FILHDR struct filehdr 
#define FILHSZ sizeof(FILHDR) 



Figure 18-5. File Header Declaration 



OPTIONAL HEADER INFORMATION 

The template for optional information varies among different 
systems that use the COFF. Applications place all system- 
dependent information into this record. This allows different 
operating systems access to information that only that 
operating system uses without forcing all COFF files to save 
space for that information. General utility programs (for 
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example, the symbol table access library functions, the 
disassembler, etc.) are made to work properly on any common 
object file. This is done by seeking past this record using the 
size of optional header information in the file header 
f_opthdr. 



Standard UNIX System a.out Header 

By default, files produced by the link editor for a UNIX system 
always have a standard UNIX system a.out header in the 
optional header field. The UNIX system a.out header is 28 
bytes. The fields of the optional header are described in Figure 
18-6 and 18-7. 
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Bytes 


Declaration 


Name 


Description 


0-1 


short 


magic 


Magic number 


2-3 


short 


vstamp 


Version stamp 


4-7 


long int 


tsize 


Size of text 
in bytes 


8-11 


long int 


dsize 


Size of initialized 
data in bytes 


12-15 


long int 


bsize 


Size of uninitialized 
data in bytes 


16-19 


long int 


duml 


Unused dummy field 


20-23 


long int 


dum2 


Unused dummy field 


24-27 


long int 


entry 


Entry point 


27-31 


long int 


text_start 


Base address of text 


32-35 


long int 


data_start 


Base address of data 



Figure 18-6. Optional Header 

(3B20S Computers Only) 



Contents 
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Bytes 


Declaration 


Name 


Description 


0-1 


short 


magic 


Magic number 


2-3 


short 


vstamp 


Version stamp 


4-7 


long int 


tsize 


Size of text in bytes 


8-11 


long int 


dsize 


Size of initialized 
data in bytes 


12-15 


long int 


bsize 


Size of uninitialized 
data in bytes 


16-19 


long int 


entry 


Entry point 


20-23 


long int 


text_start 


Base address 
of text 


24-37 


long int 


data_start 


Base address of data 



Figure 18-7. Optional Header Contents (UNIX PC and 
Processors other than the 3B20S) 



The magic number in the optional header supplies operating 
system dependent information about the object file, whereas 
the magic number in the file header specifies the machine on 
which the object file runs. The magic number in the optional 
header supplies information telling the operating system on 
that machine how that file should be executed. 

The magic numbers recognized by the UNIX operating system 
are given in Figure 18-8. 
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Value 


Meaning 


0407 


The text segment is not 
write-protected or 
sharable; the data 
segment is contiguous 
with the text segment. 


0410 


The data segment 
starts at the next 
segment following the 
text segment and the 
text segment is write 
protected. 


0413 


The data segment 
starts at a certain 
boundary within the 
next segment following 
the text segment. The 
text segment is shared, 
demand paged, and 
write protected. 



Figure 18-8. UNIX System Magic Numbers 



UNIX PC Shared Library 

Programs which use the UNIX PC shared library (see 
slilib(4)) have a magic number of 0413. They are identified as 
shared library programs NOT by the magic number but by 
having an extra section (.lib) link into the program. This extra 
section is the result of invoking the ld(l) command as described 
in the shlib(4) manual page. In addition the UNIX size(l) 
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command will report the presence of this extra section. 



Optional Header Declaration 

The C language structure declaration currently used for the 
UNIX system a.out file header is given in Figure 18-9. This 
declaration may be found in the header file aouthdr.h. 



typedef struct aouthdr { 



short magic; 
short vstamp; 
long tsize; 

long dsize; 

long bsize; 

long entry; 

long text_start; 

long data_start 



/* magic number */ 

/* version stamp */ 

/* text size in bytes, padded */ 

/* to full word boundary */ 

/* initialized data size */ 

/* uninitialized data size */ 

/* entry point */ 

/* base of text for this file */ 

/* base of data for this file */ 



AOUTHDR; 



Figure 18-9. Aouthdr Declaration 



SECTION HEADERS 

Every object file has a table of section headers to specify the 
layout of data within the file. The section header table consists 
of one entry for every section in the file. The information in 
the section header is described in Figure 18-10. 
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Bytes 


Declaration 


Name 


Description 


0-7 


char 


s_name 


8-char null 
padded section 
name 


8-11 


long int 


s_paddr 


Physical 
address of section 


12-15 


long int 


s_vaddr 


Virtual 

address of section 


16-19 


long int 


s_size 


Section 
size in bytes 


20-23 


long int 


s_scnptr 


File pointer 
to raw data 


24-27 


long int 


s_relptr 


File ptr to 
relocation 
entries 


28-31 


long int 


s_lnnoptr 


File ptr to line 
number entries 


32-33 


unsigned 
short 


s_nreloc 


Number of 
entries 


34-35 


unsigned 
short 


s_nlnno 


Number of line 
number entries 


36-39 


long int 


s_flags 


Flags (see 
Figure 18-11) 



Figure 18-10. Section Header Contents 



The size of a section is padded to a multiple of 4 bytes. 
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Pile pointers are byte offsets that can be used to locate the 
start of data, relocation, or line number entries for the section. 
They can be readily used with the UNIX system function 
fseek(3S). 



Flags 

The lower 4 bits of the flag field indicate a section type. The 
flags are described in Figure 18-11. 



Mnemonic 


Flag 


Meaning 


STYP_REG 


0x00 


Regular section 
(allocated, 
relocated, loaded) 


STYP.DSECT 


0x01 


Dummy section 
(not allocated, 
relocated, not 
loaded) 


STYP.NOLOAD 


0x02 


Noload section 
(allocated, 
relocated, not 
loaded) 



Figure 18-11. Section Header Flags (Sheet 1 of 2) 
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Mnemonic 


Flag 


Meaning 


STYP.GROUP 


0x04 


Grouped section 
(formed from 
input sections) 


STYP_PAD 


0x08 


Padding section 
(not allocated, 
not relocated, 
loaded) 


STYP.COPY 


0x10 


Copy section (for 
a decision 
function used in 
updating fields; 
not allocated, not 
relocated, loaded, 
relocation and 
line number 
entries processed 
normally) 









Figure 18-11. Section Header Flags (Sheet 2 of 2) 



Section Header Declaration 

The C structure declaration for the section headers is described 
in Figure 18-12. This declaration may be found in the header 
file scnhdr.h. 
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struct scnhdr { 






char 


s_name[8]; 


/* section name */ 


long 


s_paddr; 


/* physical address */ 


long 


s_vaddr; 


/* virtual address */ 


long 


s_size; 


/* section size */ 


long 


s_scnptr; 


/* file ptr to */ 

/* section raw data */ 


long 


s_relptr; 


/* file ptr to relocation */ 


long 


s_lnnoptr; 


/* file ptr to line number */ 


unsigned short 


s_nreloc; 


/* number of relocation */ 
/* entries */ 


unsigned short 


s_nlnno; 


/* number of line number */ 
/* entries */ 


long 


s_flags; 


/* flags V 



#define SCNHDR struct scnhdr 
#define SCNHSZ sizeof(SCNHDR) 



Figure 18-12. Section Header Declaration 



.bss Section Header 

The one deviation from the normal rule in the section header 
table is the entry for uninitialized data in a .bss section. A 
.bss section has a size and symbols that refer to it, and 
symbols that are defined in it. At the same time, a .bss 
section has no relocation entries, no line number entries, and no 
data. Therefore, a .bss section has an entry in the section 
header table but occupies no space elsewhere in the file. In this 
case, the number of relocation and line number entries, as well 
as all file pointers in a .bss section header, are 0. 
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SECTIONS 

Figure 18-1 shows that section headers are followed by the 
appropriate number of bytes of text or data. The raw data for 
each section begins on a full word boundary in the file. 

Files produced by the cc and the as always contain three 
sections, called .text, .data, and .bss. The .text section 
contains the instruction text (i.e., executable code), the .data 
section contains initialized data variables, and the .bss section 
contains uninitialized data variables. 



The link editor "SECTIONS directives" (see Chapter 17) allows 
users to 

• Describe how input sections are to be combined. 

• Direct the placement of output sections. 

• Rename output sections. 

If no SECTIONS directives are given, each input section 
appears in an output section of the same name. For example, if 
a number of object files from the " cc" are linked together 
(each containing the three sections .text, .data, and .bss), the 
output object file contains three sections, .text, .data, and 
.bss. 



RELOCATION INFORMATION 

Object files have one relocation entry for each relocatable 
reference in the text or data. The relocation information 
consists of entries with the format described in Figure 18-13. 
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Bytes 


Declaration 


Name 


Description 


0-3 


long int 


r_symndx 


(Virtual) 
address 
of reference 


4-7 


long int 


r_symndx 


Symbol 

table 

index 


8-9 


unsigned short 


r_type 


Relocation 
type 



Figure 18-13. Relocation Section Contents 



The first 4 bytes of the entry are the virtual address of the text 
or data to which this entry applies. The next field is the index, 
counted from 0, of the symbol table entry that is being 
referenced. The type field indicates the type of relocation to be 
applied. 

As the link editor reads each input section and performs 
relocation, the relocation entries are read. They direct how 
references found within the input section are treated. 

The currently recognized relocation types are given in Figures 
18-14 through 18-16. 
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Mnemonic 


Flag 


Meaning 


R_ABS 





Reference is 
absolute; no 
relocation is 
necessary. The 
entry will be 
ignored. 


R_DIR24 


04 


Direct 24-bit 
reference to the 
symbol's virtual 
address. 


R_REL24 


05 


A "PC-relative" 
24-bit reference 
to the symbol's 
virtual address. 
Actual address is 
calculated by 
adding a constant 
to the PC value. 



Figure 18-14. UNIX PC and 3B20S Computers 
Relocation Types 
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Mnemonic 


Flag 


Meaning 


R_BS 





Reference is 
absolute; no 
relocation is 
necessary. The 
entry will be 
ignored. 


R_DIR32 


06 


Direct 32-bit 
reference to the 
symbol's virtual 
address. 


R_DIR32S 


012 


Direct 32-bit 
reference to the 
symbol's virtual 
address, with the 
32-bit value 
stored in the 
reverse order in 
the object file. 



Figure X8-15. 3B5 and 3B2 Relocation Types 



18-24 



COFF 



Mnemonic 


Flag 


Meaning 


R_ABS 





Reference is 
absolute; no 
relocation is 
necessary. The 
entry will be 
ignored. 


R_RELBYTE 


017 


Direct 8-bit 
reference to the 
symbol's virtual 
address. 


R_RELWORD 


020 


Direct 16-bit 
reference to the 
symbol's virtual 
address. 


R_RELLONG 


021 


Direct 32-bit 
reference to the 
symbol's virtual 
address. 


R_PCRBYTE 


022 


A "PC-relative" 
8-bit reference to 
the symbol's 
virtual address. 


R_PCRWORD 


023 


A "PC-relative" 
16-bit reference 
to the symbol's 
virtual address. 


R_PCRLONG 


024 


A "PC-relative" 
32-bit reference 
to the symbol's 
virtual address. 



Figure 18-16. UNIX PC VAX Relocation Types 
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On the VAX processors, relocation of a symbol index of -1 
indicates that the amount by which the section is being 
relocated is added to the relocatable address. 

The as automatically generates relocation entries which are 
then used by the link editor. The link editor uses this 
information to resolve external references in the file. 



Relocation Entry Declaration 

The structure declaration for relocation entries is given in 
Figure 18-17. This declaration may be found in the header file 
reloc.h. 



struct reloc { 
long 

long 

unsigned short 



r_vaddr; /* virtual address */ 
/* of reference */ 

r_symndx; /* index into symbol */ 
/* table */ 

r_type; /* relocation type */ 



#define RELOC struct reloc 
#defineRELSZ 10 



Figure 18-17. Relocation Entry Declaration 
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LINE NUMBERS 

When invoked with the -g option, UNIX system ccs (cc, f77) 
generates an entry in the object file for every C language 
source line where a breakpoint can be inserted. You can then 
reference line numbers when using a software debugger like 
sdh. All line numbers in a section are grouped by function as 
shown in Figure 18-18. 



symbol index 





physical address 


line number 


physical address 


line number 






symbol index 





physical address 


line number 


physical address 


line number 



Figure 18-18. Line Number Grouping 



The first entry in a function grouping has line number and 
has, in place of the physical address, an index into the symbol 
table for the entry containing the function name. Subsequent 
entries have actual line numbers and addresses of the text 
corresponding to the line numbers. The line number entries 
appear in increasing order of address. 
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Line Number Declaration 



The structure declaration currently used for line number 
entries is given in Figure 18-19. 



struct lineno { 






union 

{ 

long 






l_symndx; 


/* symtbl index of */ 






/* func name */ 


long 


Lpaddr; 


/* paddr of line number */ 


} Laddr; 






unsigned short 


l_lnno; 


/* line number */ 



#define LINENO struct lineno 
#define LINESZ 6 



Figure 18-19. Line Number Entry Declaration 



SYMBOL TABLE 

Because of symbolic debugging requirements, the order of 
symbols in the symbol table is very important. Symbols appear 
in the sequence shown in Figure 18-20. 
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file name 1 

function 1 

local symbols 
for function 1 



function 2 



local symbols 
for function 2 



statics 



file name 2 



function 1 



local symbols 
for function 1 



statics 



defined global 
symbols 



undefined global 
symbols 



Figure 18-20. COFF Global Symbol Table 



The word "statics" in Figure 18-20 means symbols defined in 
the C language storage class static outside any function. The 
symbol table consists of at least one fixed-length entry per 
symbol with some symbols followed by auxiliary entries of the 
same size. The entry for each symbol is a structure that holds 
the value, the type, and other information. 
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Special Symbols 



The symbol table contains some special symbols that are 
generated by the cc, as, and other tools. These symbols are 
given in Figure 18-21. 



Symbol 


Meaning 


.file 


file name 


•text 


address of .text section 


.data 


address of .data section 


.bss 


address of .bss section 


.bb 


address of start of inner block 


.eb 


address of end of inner block 


.bf 


address of start of function 


.ef 


address of end of function 


.target 


pointer to the structure or 
union returned by a function 


.ajfake 


dummy tag name for 
structure, union, or enumeran 



Figure 18-21. Special Symbols in the Symbol Table 
(Sheet 1 of 2) 
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Symbol 


Meaning 


.eos 


end of members of 
structure, union, or 
enumeration 


_etext,etext 


next available address 
after the end of the 
output section .text 


_edata,edata 


next available address 
after the end of the 
output section .data 


_end,end 


next available address 
after the end of the 
output section .bss. 



Figure 18-21. Special Symbols in the Symbol Table 
(Sheet 2 of 2) 



Six of these special symbols occur in pairs. The .bb and .eb 
symbols indicate the boundaries of inner blocks. A .bf and .ef 
pair brackets each function; and a .crfake and .eos pair names 
and defines the limit of structures, unions, and enumerations 
that were not named. The .eos symbol also appears after 
named structures, unions, and enumerations. 

When a structure, union, or enumeration has no tag name, the 
cc invents a name to be used in the symbol table. The name 
chosen for the symbol table is .ccfake, where "x" is an integer. 
If there are three unnamed structures, unions, or enumerations 
in the source, their tag names are ".Ofake", ".Ifake", and 
".2fake". 

Each of the special symbols has different information stored in 
the symbol table entry as well as the auxiliary entry. 
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Inner Blocks 

The C language defines a block as a compound statement that 
begins and ends with braces ( { and } ). An inner block is a 
block that occurs within a function (which is also a block). 

For each inner block that has local symbols defined, a special 
symbol .bb is put in the symbol table immediately before the 
first local symbol of that block. Also a special symbol, .eb is 
put in the symbol table immediately after the last local symbol 
of that block. The sequence is shown in Figure 18-22. 



.bb 



local symbols 
for that block 



.eb 



Figure 18-22. Special Symbols (.bb and .eb) 



Because inner blocks can be nested by several levels, the .bb- 
.eb pairs and associated symbols may also be nested. See 
Figure 18-23. 
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/* block 1 */ 


int i; 




char c; 




1 


/* block 2 */ 



long a; 



int x; 



long i; 



/* block 3 */ 



/* block 3 */ 
/* block 2 */ 

/♦block 4*/ 



/* block 4 */ 
/* block 1 */ 



Figure 18-23. Nested blocks 



The symbol table would look like Figure 18-24. 



COFF 
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.bb for block 1 


i 


c 


.bb for block 2 


a 


.bb for block 3 


X 


.eb for block 3 


.eb for block 2 


.bb for block 4 


i 


.bb for block 4 


.eb for block 1 



Figure 18-24. Example of the Symbol Table 



Symbols and Functions 

For each function, a special symbol .bf is put between the 
function name and the first local symbol of the function in the 
symbol table. Also, a special symbol .ef is put immediately 
after the last local symbol of the function in the symbol table. 
The sequence is shown in Figure 18-25. 
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function name 



.bf 



local signal 



.ef 



Figure 18-25. Symbols for Functions 



If the return value of the function is a structure or union, a 
special symbol .target is put between the function name and 
the .bf. The sequence is shown in Figure 18-26. 



function name 



.target 



.bf 



local symbols 



.ef 



Figure 18-26. Special Symbol .Target 



The cc invents .target to store the function-return structure or 
union. The symbol .target is an automatic variable with 
"pointer" type. Its value field in the symbol is always 0. 
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Symbol Table Entries 

All symbols, regardless of storage class and type, have the same 
format for their entries in the symbol table. The symbol table 
entries each contain the 18 bytes of information. The meaning 
of each of the fields in the symbol table entry is described in 
Figure 18-27 

It should be noted that indices for symbol table entries begin at 
and count upward. Each auxiliary entry also counts as one 
symbol. 



Bytes 


Declaration 


Name 


Description 


0-7 


(see text below) 


_n 


These 8 bytes 
contain either 
the name of a 
pointer or the 
name of a 
symbol. 


8-11 


long int 


n_value 


Symbol value; 
storage class 
dependent 


12-13 


short 


n_scnum 


Section 

number of 
symbol 


14-15 


unsigned short 


n_type 


Basic and 
derived type 
specification 


16 


char 


n_sclass 


Storage class 
of symbol 


17 


char 


n_numaux 


Number of 

auxiliary 

entries. 



figure 18-27. Symbol Table Entry Format 



18-36 



COFF 



Symbol Names 

The first 8 bytes in the symbol table entry are a union of a 
character array and two longs. If the symbol name is eight 
characters or less, the (null-padded) symbol name is stored 
there. If the symbol name is longer than eight characters, then 
the entire symbol name is stored in the string table. In this 
case, the 8 bytes contain two long integers, the first is zero, and 
the second is the offset (relative to the beginning of the string 
table) of the name in the string table. Since there can be no 
symbols with a null name, the zeroes on the first 4 bytes serve 
to distinguish a symbol table entry with an offset from one 
with a name in the first 8 bytes as shown in Figure 18-28. 



Bytes 


Declaration 


Name 


Description 


0-7 


char 


n_name 


8-character 
null-padded 
symbol name 


0-3 


long 


n_zeroes 


Zero in this 
field indicates 
the name is 
in the string 
table 


4-7 


long 


n_offset 


Offset of the 
name in the 
string table 



Figure 18-28. Name Field 



Some special symbols are generated by the cc and link editor as 
discussed in " special symbols" . 
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Storage Classes 



The storage class field has one of the values described in Figure 
18-29. These " defines" may be found in the header file 
storclass.h. 
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Mnemonic 


Value 


Storage Class 


C_EFCN 


-1 


physical end of a function 


C.NULL 





- 


C_AUTO 


1 


automatic variable 


C_EXT 


2 


external symbol 


C_STAT 


3 


static 


C_REG 


4 


register variable 


C.EXTDEF 


5 


external definition 


C_LABEL 


6 


label 


C_ULABEL 


7 


undefined label 


CMOS 


8 


member of structure 


C_ARG 


9 


function argument 


C_STRTAG 


10 


structure tag 


C.MOU 


11 


member of union 


C.UNTAG 


12 


union tag 


C_TPDEF 


13 


type definition 


C_USTATIC 


14 


uninitialized static 


C.ENTAG 


15 


enumeration tag 


C_MOE 


16 


member of enumeration 


C.REGPARM 


17 


register parameter 


C.FIELD 


18 


bit field 



Figure 18-29. Storage Classes (Sheet 1 of 2) 
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Mnemonic 

C_BLOCK 


Value 

100 


Storage Class 

beginning and end of block 


C_FCN 


101 


beginning and end of function 


C_EOS 


102 


end of structure 


C_FILE 


103 


file name 


C.LINE 


104 


used only by utility programs 


C_ALIAS 


105 


duplicated tag 


CHIDDEN 


106 


like static, used to avoid 
name conflicts 



Figure 18-29. Storage Classes (Sheet 2 of 2) 



All of these storage classes except for C_ALIAS and C- 
HIDDEN are generated by the " cc" or " as" . The compress 
utility, cprs, generates the C_ALIAS mnemonic. This utility 
(described in the UNIX System Reference Manual) removes 
duplicated structure, union, and enumeration definitions and 
puts ALIAS entries in their places. The storage class C- 
HIDDEN is not used by any UNIX system tools. 

Some of these storage classes are used only internally by the 
" cc" and the " as" . These storage classes are C_EFCN, 
C.EXTDEF, C_ULABEL, C.USTATIC, and C_LINE. 



Storage Classes for Special Symbols 

Some special symbols are restricted to certain storage classes. 
They are given in Figure 18-30. 
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Special Symbol 


Storage Class 


.file 


C_FILE 


.bb 


C_BLOCK 


.eb 


C_BLOCK 


.bf 


C_FCN 


.ef 


C_FCN 


.target 


C.AUTO 


.a;fake 


C.STRTAG, C.UNTAG, C_ENTAG 


.eos 


C.EOS 


.text 


C_STAT 


.data 


C_STAT 


.bss 


C_STAT 



Figure 18-30. Storage Class by Special Symbols 



Also some storage classes are used only for certain special 
symbols. They are summarized in Figure 18-31. 
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Storage Class 


Special Symbol 


C.BLOCK 


•bb, .eb 


C_FCN 


.bf, .ef 


C.EOS 


•eos 


C_FILE 


.file 



Figure 18-31. Restricted Storage Classes 



Symbol Value Field 

The meaning of the "value" of a symbol depends on its storage 
class. This relationship is summarized in Figure 18-32. 
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Storage Class 


Meaning 


C_AUTO 


stack offset in bytes 


C_EXT 


relocatable address 


C.STAT 


relocatable address 


C_REG 


register number 


C_LABEL 


relocatable address 


CMOS 


offset in bytes 


C_ARG 


stack offset in bytes 


C.STRTAG 





C_MOU 





C.UNTAG 





C_TPDEF 





C.ENTAG 





C_MOE 


enumeration value 


C_REGPARM 


register number 


CTIELD 


bit displacement 


C.BLOCK 


relocatable address 


C_FCN 


relocatable address 


C.EOS 


size 


C.FILE 


(see text below) 


C.ALIAS 


tag index 


CHIDDEN 


relocatable address 



Figure 18-32. Storage Class and Value 



If a symbol has storage class C.PILE, the value of that symbol 
equals the symbol table entry index of the next .file symbol. 
That is, the .file entries form a 1-way linked list in the symbol 
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table. If there are no more .file entries in the symbol table, 
the value of the symbol is the index of the first global symbol. 

Relocatable symbols have a value equal to the virtual address 
of that symbol. When the section is relocated by the link 
editor, the value of these symbols changes. 



Section Number Field 

Section numbers are listed in Figure 18-33. 



Mnemonic 


Section Number 


Meaning 


N_DEBUG 


-2 


Special symbolic 

debugging 

symbol 


N_ABS 


-1 


Absolute symbol 


N_UNDEF 





Undefined 
external symbol 


N_SCNUM 


1-077777 


Section number 
where symbol 
was defined 



Figure 18-33. Section Number 



A special section number (-2) marks symbolic debugging 
symbols, including structure/union/enumeration tag names, 
typedefs, and the name of the file. A section number of -1 
indicates that the symbol has a value but is not relocatable. 
Examples of absolute-valued symbols include automatic and 
register variables, function arguments, and .eos symbols. The 
.text, .data, and .bss symbols default to section numbers 1, 2, 
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and 3, respectively. 

With one exception, a section number of indicates a 
relocatable external symbol that is not defined in the current 
file. The one exception is a multiply defined external symbol 
(i.e., FORTRAN common or an uninitialized variable defined 
external to a function in C). In the symbol table of each file 
where the symbol is defined, the section number of the symbol 
is and the value of the symbol is a positive number giving the 
size of the symbol. When the files are combined, the link editor 
combines all the input symbols into one symbol with the section 
number of the .bss section. The maximum size of all the input 
symbols with the same name is used to allocate space for the 
symbol and the value becomes the address of the symbol. This 
is the only case where a symbol has a section number of and 
a non-zero value. 



Section Numbers and Storage Classes 

Symbols having certain storage classes are also restricted to 
certain section numbers. They are summarized in Figure 18-34. 
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Storage Class 


Section Number 


C AUTO 


N_ABS 


C_EXT 


N_ABS, N.UNDEF, N.SCNUM 


C_STAT 


N_SCNUM 


C_REG 


N_ABS 


C.LABEL 


N_UNDEF, N.SCNUM 


CMOS 


N_ABS 


C_ARG 


N_ABS 


C.STRTAG 


N_DEBUG 


C_MOU 


N_ABS 


C.UNTAG 


N.DEBUG 


C_TPDEF 


N.DEBUG 


C.ENTAG 


N_DEBUG 


C_MOE 


N_ABS 


C_REGPARM 


N_ABS 


C.FIELD 


N_ABS 


C.BLOCK 


N_SCNUM 


C_FCN 


N_SCNUM 


C.EOS 


N_ABS 


C_FILE 


N_DEBUG 


C.ALIAS 


N_DEBUG 



Figure 18-34. Section Number and Storage Class 
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Type Entry 

The type field in the symbol table entry contains information 
about the basic and derived type for the symbol. This 
information is generated by the " cc" . The VAX " cc" generates 
this information only if the — g option is used. Each symbol 
has exactly one basic or fundamental type but can have more 
than one derived type. The format of the 16-bit type entry is 



d6 


d5 


d4 


d3 


d2 


dl 


typ 



Bits through 3, called "typ", indicate one of the fundamental 
types given in Figure 18-35. 
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Mnemonic 


Value 


Type 


T_NULL 





type not assigned 


T.CHAR 


2 


character 


T_SHORT 


3 


short integer 


TINT 


4 


integer 


T_LONG 


5 


long integer 


T_FLOAT 


6 


floating point 


T.DOUBLE 


7 


double word 


T.STRUCT 


8 


structure 


T.UNION 


9 


union 


T_ENUM 


10 


enumeration 


T_MOE 


11 


member of enumeration 


T.UCHAR 


12 


unsigned character 


T_USHORT 


13 


unsigned short 


T_UINT 


14 


unsigned integer 


T.ULONG 


15 


unsigned long 



Figure 18-35. Fundamental Types 



Bits 4 through 15 are arranged as six 2-bit fields marked "dl" 
through "d6." These "d" fields represent levels of the derived 
types given in Figure 18-36, 
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Mnemonic 


Value 


Type 


DT_NON 





no derived type 


DT_PTR 


1 


pointer 


DT_FCN 


2 


function 


DT_ARY 


3 


array 



Figure 18-36. Derived Types 



The following examples demonstrate the interpretation of the 
symbol table entry representing type. 

char *func(); 

Here func is the name of a function that returns a pointer to a 
character. The fundamental type of func is 2 (character), the 
dl field is 2 (function), and the d2 field is 1 (pointer). 
Therefore, the type word in the symbol table for func contains 
the hexadecimal number 0x62, which is interpreted to mean 
"function that returns a pointer to a character." 

short *tabptr[10][25][3]; 

Here tabptr is a 3-dimensional array of pointers to short 
integers. The fundamental type of tabptr is 3 (short integer); 
the dl, d2, and d3 fields each contains a 3 (array), and the d4 
field is 1 (pointer). Therefore, the type entry in the symbol 
table contains the hexadecimal number Ox7f3 indicating a "3- 
dimensional array of pointers to short integers." 
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Type Entries and Storage Classes 



Figure 18-37 shows the type entries that are legal for each 
storage class. 



Storage 
Class 


_______< 


*rl" «»iitrv 


**typ" entry 
Basic Type 




Function? 


Array? 


Pointer? 


C_AUTO 


no 


yes 


yes 


Any except 
T_MOE 


C_EXT 


yes 


yes 


yes 


Any except 
T_MOE 


C.STAT 


yes 


yes 


yes 


Any except 
T_MOE 


C_REG 


no 


no 


yes 


Any except 
T_MOE 


C.LABEL 


no 


no 


no 


T_NULL 


CMOS 


no 


yes 


yes 


Any except 
T_MOE 


C_ARG 


yes 


no 


yes 


Any except 
T_MOE 


C.STRTAG 


no 


no 


no 


T_STRUCT 


C_MOU 


no 


yes 


yes 


Any except 
T_MOE 


C.UNTAG 


no 


no 


no 


T_UNION 



Figure 18-37. Type Entries by Storage Class 
(Sheet 1 of 2) 
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Storage 
Class 




<^>» ^«f«,r 




"typ" entry 
Basic Type 




Function? 


Array? 


Pointer? 


C_TPDEF 


no 


yes 


yes 


Any except 
T_MOE 


C.ENTAG 


no 


no 


no 


T_ENUM 


C_MOE 


no 


no 


no 


T_MOE 


C.REGPARM 


no 


no 


yes 


Any except 
T_MOE 


C.FIELD 


no 


no 


no 


T_ENUM, 

T_UCHAR, 

T_USHORT, 

T_UNIT, 

T_ULONG 


C.BLOCK 


no 


no 


no 


T_NULL 


C_FCN 


no 


no 


no 


T_NULL 


C.EOS 


no 


no 


no 


T_NULL 


C_FILE 


no 


no 


no 


T_NULL 


C_ALIAS 


no 


no 


no 


T_STRUCT, 
T_UNION<, 

T_ENUM 



Figul*e 18-37. Type Entries 
(Sheet 2 of 2) 



by Storage Class 



Conditions for the "d" entries apply to dl through d6, except 
that it is impossible to have two consecutive derived types of 
"function." 



18-51 



COFF 



Although function arguments can be declared as arrays, they 
are changed to pointers by default. Therefore, no function 
argument can have "array" as its first derived type. 



Structure for Symbol Table Entries 

The C language structure declaration for the symbol table entry 
is given in Figure 18-38. This declaration may be found in the 
header file syms.h. 
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struct syment 



char _n_name[SYMNMLEN]; 

/* symbol name*/ 



struct 



long _n_zeroes; 

/* symbol name */ 



long 


_n_offset; 




/* location in string table */ 


} _n_n; 




char 


_n_nptr[2]; 


1 n- 


/* allows overlaying */ 


long 


n_value; 




/* value of symbol */ 


short 


n_scnum; 




/* section number */ 



unsigned short n_type; 

/* type and derived */ 

char n_sclass; 

/* storage class */ 

char n_numaux; 

/* number of aux entries */ 



Mefine n_name _n._n_name 

Mefine n_zeroes _n._n_n._n_zeroes 

#define n_offset _n._n_n._n_offset 

#define n_nptr _n._n_nptr[l] 

Mefine SYMNMLEN 8 

#define SYMESZ 18 /* size of a symbol table entry */ 



Figure 18-38. Symbol Table Entry Declaration 
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Auxiliary Table Entries 

Currently, there is at most one auxiliary entry per symbol. The 
auxiliary table entry contains the same number of bytes as the 
symbol table entry. However, unlike symbol table entries, the 
format of an auxiliary table entry of a symbol depends on its 
type and storage class. They are summarized in Figure 18-39. 
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Name 


Storage 
Class 


Type Entry 


Auxiliary 
Entry Format 


dl 


typ 


•file 


C_FILE 


DT.NON 


T.NULL 


file name 


.text,.data, 
.bss 


C.STAT 


DT.NON 


T_NULL 


section 


tagname 


C_STRTAG 

C_UNTAG 

C_ENTAG 


DT_NON 


T_NULL 


tag name 


.eos 


C_EOS 


DT_NON 


T_NULL 


end of 
structure 


fcname 


C_EXT 
C_STAT 


DT.FCN 


(Note 1) 


function 


arrname 
.bb 


(Note 2) 
C.BLOCK 


DT_ARY 
DT_NON 


(Note 1) 
T_NULL 


array 
beginning 
of block 


.eb 


C_BLOCK 


DT.NON 


T_NULL 


end of block 


.bf,.ef 


C_FCN 


DT_NON 


T_NULL 


beginning 
and end of 
function 


name related 
to structure 
union, 
enumeration 


(Note 2) 


DT_PTR 

DT_ARR, 

DT_NON 


T_STRUCT, 
T_UNION, 

T_ENUM 


name related 
to structure, 
union, 
enumeration 



Notes: 

1. Any except T_MOE. 

2. C_AUTO, C_STAT, C_MOS, C_MOU, C.TPDEF. 

Figure 18-39. Auxiliary Symbol Table Entries 



In Figure 18-39, "tagname" means any symbol name including 
the special symbol .ajfake, and "fcname" and "arrname" 

18-55 



COFF 



represent any symbol name. 



Any symbol that satisfies more than one condition in Figure 
18-39 should have a union format in its auxiliary entry. 
Symbols that do not satisfy any of the above conditions should 
NOT have any auxiliary entry. 



File Names 

Each of the auxiliary table entries for a file name contains a 
14-character file name in bytes through 13. The remaining 
bytes are 0, regardless of the size of the entry. 



Sections 

The auxiliary table entries for sections have the format as 
shown in Figure 18-40. 



Bytes 


Declaration 


Name 


Description 


0-3 


long int 


x_scnlen 


section 
length 


4-6 


unsigned short 


x_nreloc 


number of 
relocation 
entries 


6-7 


unsigned short 


x_nlinno 


number of 
line numbers 


8-17 






unused (filled 
with zeroes) 



Figure l»-40. Format for Auxiliary rabie Entries 
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Tag Names 

The auxiliary table entries for tag names have the format 
shown in Figure 18-41. 



Bytes 


Declaration 


Name 


Description 


0-5 


- 


- 


unused (filled 
with zeros) 


6-7 


unsigned short 


x_size 


size of strucrt, 

union,and 

enumeration 


8-11 


- 


- 


unused (filled 
with zeroes) 


12-15 


long int 


x_endndx 


index of next 
entry beyond 
this structure, 
union, or 
enumeration 


16-17 






unused (filled 
with zeroes) 



Figure 18-41. Tag Names Table Entries 



End of Structures 

The auxiliary table entries for the end of structures have the 
format shown in Figure 18-42: 
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Bytes 


Declaration 


Name 


Description 


0-3 


long int 


x_tagndx 


tag index 


4-5 


- 


- 


unused (filled 
with zeroes) 


6-7 


unsigned short 


x_size 


size of struct, 
union, or 
enumeration 


8-17 






unused (filled 
with zeroes) 



Figure 18-42. Table Entries for End of Structures 



Functions 

The auxiliary table entries for functions have the format shown 
in Figure 18-43: 
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Bytes 


Declaration 


Name 


Description 


0-3 


long int 


x_tagndx 


tag index 


4-7 


long int 


x_fsize 


size of 
function 
(in bytes) 


8-11 


long int 


x-lnnoptr 


file pointer 
to line number 


12-15 


long int 


x_endndx 


index of 
next entry 
beyond this 
point 


16-17 


unsigned short 


x_tvndx 


index of the 
function's address 
in the transfer 
vector table (not 
used in UNIX system) 



Figure 18-43. Table Entries for Functions 



Arrays 

The auxiliary table entries for arrays have the format shovs^n in 
Figure 18-44: 
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Bytes 


Declaration 


Name 


Description 


0-3 


long int 


x_tagndx 


tag index 


4-5 


unsigned short 


x_lnno 


line number of 
declaration 


6-7 


unsigned short 


x_size 


size of array 


8-9 


unsigned short 


x_dimen[0] 


first dimension 


10-11 


unsigned short 


x_dimen[l] 


second dimension 


12-13 


unsigned short 


x_dimen[2] 


third dimension 


14-15 


unsigned short 


x_dimen[3] 


fourth dimension 


16-17 






unused (filled 
with zeroes) 



Figure 18-44. Table Entries for Arrays 



End of Blocks and Functions 

The auxiliary table entries for the end of blocks and functions 
have the format shown in Figure 18-45: 
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Bytes 


Declaration 


Name 


Description 


0-3 


- 


- 


used (filled 
with zeroes) 


4-5 


unsigned short 


x_lnno 


C-source line 
number 


6-17 






unused (filled 
with zeroes) 



Figure 18-45. End of Block and Function Entries 



Beginning of Blocks and Functions 

The auxiliary table entries for the beginning of blocks and 
functions have the format shown in Figure 18-46: 
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Bytes 


Declaration 


Name 


Description 


0-3 


- 


- 


unused (filled 
with zeroes) 


4-5 


unsigned short 


x_lnno 


C-source line 
number 


6-11 


- 


- 


unused (filled 
with zeroes) 


12-15 


long int 


x_endndx 


index of next 
entry past 
this block 


16-17 






unused (filled 
with zeroes) 



Figure 18-46. Format for Beginning of Block 
Function 



and 



Names Related to Structures, Unions, and Enumerations 

The auxiliary table entries for structure, union, and 
enumerations symbols have the format shown in Figure 18-47: 
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Bytes 


Declaration 


Name 


Description 


0-3 


long int 


x_tagndx 


tag index 


4-5 


- 


- 


unused (filled 
with zeroes) 


6-7 


unsigned short 


x_size 


size of the 
structure, union, 
or numeration 


8-17 






unused (filled 
with zeroes) 



Figure 18-47. Entries for Structures, Unions, and 
Numerations 



Names defined by "typedef ' may or may not have auxiliary 
table entries. For example, 

typedef struct people STUDENT; 

struct people { 

char name [20]; 
long id; 

}; 

typedef struct people EMPLOYEE; 



The symbol "EMPLOYEE" has an auxiliary table entry in the 
symbol table but symbol "STUDENT" will not. 
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Auxiliary Entry Declaration 



The C language structure declaration for an auxiliary symbol 
table entry is given in Figure 18-48. This declaration may be 
found in the header file syms.h. 



union auxent 
struct { 

union { 
struct I 



} x_lnsz; 

} x_misc; 
union { 
struct I 



} x_fcn; 
struct { 

} x_ary; 
} x_fcnary; 

} x_sym; 
struct { 

} x_file; 
struct { 



I x_scn; 
struct I 



} x_tv; 



long x_tagndx; 



unsigned short x_lnno; 
unsigned short x_size; 

long x_fsize; 



long x_lnnoptr; 
long x_endndx; 



unsigned short x_dimen[DIMNUM]; 
unsigned short x_tvndx; 

char x_fname[FILNMLEN]; 



long x_scnlen; 
unsigned short x_nreloc; 
unsigned short x_nlinno; 



long x_tvfill; 
unsigned short x_tvlen; 
unsigned short x_tvran[2]; 



#define FILNMLEN 14 
#define DIMNUM 4 
#define AUXENT union auxent 
#define AUXESZ 18 

Figure 18-48. Auxiliary Symbol Table Entry 
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STRING TABLE 

Symbol table names longer than eight characters are stored 
contiguously in the string table with each symbol name 
delimited by a null byte. The first four bytes of the string 
table are the size of the string table in bytes; offsets into the 
string table therefore are greater than or equal to 4. 

For example, given a file containing two symbols (with names 
longer then eight characters, long_name_l and another_one) 
the string table has the format as shown in Figure 18-49: 
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28 


T 


'o' 


'n' 


'g' 


( f 


'n' 


'a' 


'm' 


'e' 


( > 


T 


\0' 


'a' 


'n' 


'o' 


't' 


'h' 


'e' 


'r' 


( > 


'o' 


'n' 


'e' 


'\0' 



Figure 18-49. String Table 



The index of long_name_l in the string table is 4 and the index 
of another_one is 16. 



ACCESS ROUTINES 

Supplied with every standard UNIX system release is a set of 
access routines that are used for reading the various parts of a 
common object file. Although the calling program must know 
the detailed structure of the parts of the object file it processes, 
the routines effectively insulate the calling program from the 
knowledge of the overall structure of the object file. In this 
way, you can concern yourself with the section you are 
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interested in without knowing all the object file details. 
The access routines can be divided into four categories: 

1. Functions that open or close an object file. 

2. Functions that read header or symbol table information. 

3. Functions that position an object file at the start of a 
particular section of the object file. 

4. A function that returns the symbol table index for a 
particular symbol. 

These routines can be found in the library libld.a and are listed 
in Section 3 of the UNIX System V User's Manual. A summary 
of what is available can be found in the UNIXSystem V User's 
Manual under LDFCN(4). 
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Chapter 19 

ARBITRARY PRECISION DESK 
CALCULATOR LANGUAGE— "be'' 



GENERAL 

The arbitrary precision desk calculator language (be) is a 
language and compiler for doing arbitrary precision arithmetic 
under the UNIX operating system. The output of the compiler 
is interpreted and executed by a collection of routines that can 
input, output, and do arithmetic on infinitely large integers and 
on scaled fixed-point numbers. These routines are based on a 
dynamic storage allocator. Overflow does not occur until all 
available core storage is exhausted. 

The be language has a complete control structure as well as 
immediate-mode operation. Functions can be defined and saved 
for later execution. A small collection of library functions is 
also available, including sin, cos, arctan, log, exponential, and 
Bessel functions of integer order. 

The be compiler was written to make conveniently available a 
collection of routines (called dc) that are capable of doing 
arithmetic on integers of arbitrary size. The compiler is not 
intended to provide a complete programming language. It is a 
minimal language facility. 

Some of the uses of this compiler are: 

• Compile large integers 

• Compute accurately to many decimal places 

• Convert numbers from one base to another base. 
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There is a scaling provision that permits the use of decimal 
point notation. Provision is also made for input and output in 
bases other than decimal. Numbers can be converted from 
decimal to octal by simply setting the output base to equal 
eight. 

The actual limit on the number of digits that can be handled 
depends on the amount of core storage available. This is 
possible even on the smallest versions of the UNIX operating 
system. 

The syntax of be is very similar to that of the C language. 
This enables users who are familiar with C language to easily 
work with be. 

The simplest kind of statement is an arithmetic expression on a 
line by itself. For instance, if you type in the addition of two 
num.bers (with the + operator) such as 

142857 -f- 285714 

the program responds immediately with the sum 

428571. 

The operators —,*,/,%, and can also be used. They indicate 
subtraction, multiplication, division, remaindering, and integer 
result truncated toward zero. Division by zero produces an 
error comment. 

Any term in an expression may be prefixed by a minus sign to 
indicate that it is to be negated (the unary minus sign). The 
expression 

7+-3 
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is interpreted to mean that -3 is to be added to 7. 

More complex expressions with several operators and with 
parentheses are interpreted just as in power, then *, %, and /, 
and finally, + and — . Contents of parentheses are evaluated 
before material outside the parentheses. Exponentiations are 
performed from right to left and the other operators from left 
to right. 

a b c and a (be) 

are equivalent as are the two expressions 

a*b*c and (a*b)*c. 

However, be shares with Fortran and C language the 
undesirable convention that 

a/b*c is equivalent to (a/b)*c. 

Internal storage registers to hold numbers have single 
lowercase letter names. The value of an expression can be 
assigned to a register in the usual way. The statement 

X = X + 3 

has the effect of increasing by three the value of the contents 
of the register named x. When, as in this case, the outermost 
operator is an "=", the assignment is performed; but the result 
is not printed. Only 26 of these named storage registers are 
available. 

There is a built-in square root function whose result is 
truncated to an integer (see the part on "SCALING"). Entering 
the lines 
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X = sqrt(191) 

X 

produces the printed result 
13 



BASES 

There are two special internal quantities; ibase (input base) 
and obase (output base). The contents of ibase, initially set 
to 10 (decimal), determines the base used for interpreting 
numbers read in. For example, the input lines 

ibase = 8 
11 

produces the output line 



and the system is ready to do octal to decimal conversions. 
Beware, hdyever, of trying to change the input base back to 
decimal by t^ing 

\ 

ibase = 10 \ 

Because the number 10 is interpreted as octal, this statement 
has no effect. For dealing in hexadecimal notation, the 
characters A through F are permitted in numbers (regardless 
of what base is in effect) and are interpreted as digits having 
values 10 through 15, respectively. The Statement 
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ibase 



changes the base to decimal regardless of what the current 
input base is. Negative and large positive input bases are 
permitted but are useless. No mechanism has been provided for 
the input of arbitrary numbers in bases less than 1 and greater 
than 16. 

The content of obase, initially 10 (decimal), is used as the base 
for output numbers. The input lines 

obase = 16 
1000 

produces the output line 

3E8 

which is to be interpreted as a 3-digit hexadecimal number. 
Very large output bases are permitted and are sometimes 
useful. For example, large numbers Can be output in groups of 
five digits by setting obase to 100000. Strange output bases 
(i.e., 1, 0, or negative) are handled appropriately. 

Very large numbers are split across lines with 10 characters per 
line. Lines which are continued end with a backslash (\), 
Decimal output conversion is practically instantaneous, but 
output of very large numbers (i.e., more than 100 digits) with 
other bases is rather slow. Nondecimal output conversion of a 
100-digit number takes about 3 seconds. 

The ibaise and obase have no effect on the course of internal 
computation or on the evaluation of expressions. They only 
affect input and output conversions, respectively. 
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SCALING 

A third special internal quantity called scale is used to 
determine the scale of calculated quantities. The number of 
digits after the decimal point of a number is referred to as its 
scale. Numbers may have up to 99 decimal digits after the 
decimal point. This fractional part is retained in further 
computations. 

The contents of scale must be no greater than 99 and no less 
than 0. It is initially set to 0. However, appropriate scaling 
can be arranged when more than 99 fraction digits are 
required. 

When two scaled numbers are combined by means of one of the 
arithmetic operations, the result has a scale determined by the 
following rules: 

• Addition and subtraction— The scale of the result is the 
larger of the scales of the two operands. In this case, there 
is never any truncation of the result. 

• Multiplication— The scale of the result is never less than 
the maximum of the two scales of the operands and never 
more than the sum of the scales of the operands. Subject 
to those two restrictions, the scale of the result is set equal 
to the contents of the internal quantity scale. 

• Division— The scale of a quotient is the contents of the 
internal quantity scale. The scale of a remainder is the 
sum of the scales of the quotient and the divisor. 

• Exponentiation— The result of an exponentiation is scaled 
as if the implied multiplications were performed. An 
exponent must be an integer. 

• Square root— The scale of a square root is set to the 
maximum of the scale of the argument and the contents of 
scale. 
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All of the internal operations are actually carried out in terms 
of integers with digits being discarded when necessary. In 
every case where digits are discarded, truncation and not 
rounding is performed. 

The internal quantities scale, ibase, and obase can be used 
in expressions just like other variables. The input line 

scale = scale + 1 
increases the value of scale by one, and the input line 

scale 

causes the current value of scale to be printed. 

The value of scale retains its meaning as a number of decimal 
digits to be retained in internal computation even when ibase 
or obase are not equal to 10. The internal computations 
(which are still conducted in decimal regardless of the bases) 
are performed to the specified number of decimal digits, never 
hexadecimal, octal, or any other kind of digits. 



FUNCTIONS 

The name of a function is a single lowercase letter. Function 
names are permitted to coincide with simple variable names. 
Twenty-six different defined functions are permitted in 
addition to the 26 variable names. The input line 

define a(x){ 

begins the definition of a function with one argument. This 
line must be followed by one or more statements which make 
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up the body of the function ending with a right brace ( } ). The 
general form of a function is 



define a(x) { 



return 

} 

Return of control from a function occurs when a return 
statement is executed or when the end of the function is 
reached. The return statement can take either of the two 
forms: 

return 
return(x) 

In the first case, the value of the function is 0; and in the 
second, the value of the function is the expression in 
parentheses. 

Variables used in the function can be declared as automatic by 
a statement of the form 

auto x,y,z 

There can be only one auto statement in a function, and it 
must be the first statement in the definition. These automatic 
variables are allocated space and initialized to zero on entry to 
the function and thrown away on return (exit). The values of 
any variables with the same names outside the function are not 
disturbed. Functions may be called recursively and the 
automatic variables at each level of call are protected. The 
parameters named in a function definition are treated in the 
same way as the automatic variables of that function with the 
single exception that they are given a value on entry to the 
function. An example of a function definition is 
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define a(x,y){ 
auto z 
z = x*y 
return(z) 

} 

The value of this function a, when called, is the product of its 
two arguments, "x" and "y"- 

A function is called by the appearance of its name followed by 
a string of arguments enclosed in parentheses and separated by 
commas. The result is unpredictable if the wrong number of 
arguments is used. 

Functions with no arguments are defined and called using 
parentheses with nothing between them: (). 

If the function a above has been defined, then the line 

a(7,3.14) 
causes the result 21,98 to be printed, and the line 

z = a(a(3,4),5) 
causes the result 60 to be printed. 



SUBSCRIPTED VARIABLES 

A single lowercase letter variable name followed by an 
expression in brackets is called a subscripted variable (an array 
element). The variable name is called the array name, and the 
expression in brackets is called the subscript. Only 1- 
dimensional arrays are permitted. The names of arrays are 
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permitted to coincide with the names of simple variables and 
function names. Any fractional part of a subscript is discarded 
before use. Subscripts must be greater than or equal to and 
less than or equal to 2047. 

Subscripted variables may be used in expressions, in function 
calls, and in return statements. 

An array name may be used as an argument to a function or 
may be declared as automatic in a function definition by the 
use of empty brackets: 

f(a[]) 

define f(a[]) 
auto a[] 

When an array name is so used, the whole contents of the array 
are copied for the use of the function and thrown away on exit 
from the function. Array names that refer to whole arrays 
cannot be used in any other contexts. 



CONTROL STATEMENTS 

The if, while, and for statements may be used to alter the 
flow within programs or to cause iteration. The range of each 
of them is a statement or a compound statement consisting of a 
collection of statements enclosed in braces. They are written in 
the following way: 

if( relation) statement 

while(relation) statement 

for(expressionl; relation; expression2) statement 



or 
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if (relation) {statements} 

while(relation) {statements} 

for(expressionl; relation; expression2) {statements} 

A relation in one of the control statements is an expression of 
the form 



x>y 

where two expressions are related by one of the following six 
relational operators: 

< less than 

> greater than 

<= less than or equal to 

>= greater than or equal to 

== equal to 

!^ not equal to 



Beware of using "=" instead of "==" as a relational operator. 
Unfortunately, both of these are legal, so there will be no 
diagnostic message, but "=" will not do a comparison. 

The if statement causes execution of its range if and only */the 
relation is true. Then control passes to the next statement in 
sequence. 

The while statement causes execution of its range repeatedly 
as long as the relation is true. The relation is tested before 
each execution of its range; and if the relation is false, control 
passes to the next statement beyond the range of the while 
statement. 

The for statement begins by executing expression 1. Then the 
relation is tested; and if true, the statements in the range of 
the for are executed. Then expression2 is executed. The 
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relation is then tested, etc. The typical use of the for 
statement is for a controlled iteration, as in the statement 

for(i=l; i<=10; i=i+l) i 

which prints the integers from one to ten. The following are 
some examples of the use of the control statements: 

define f(n){ 

auto i, X 

x=l 

for(i=l; i<=n; i=i+l) x=x*i 

return(x) 

} 
The input line 

f(a) 

prints "a" factorial if "a" is a positive integer. The following is 
the definition of a function that computes values of the 
binomial coefficient (m and n are assumed to be positive 
integers): 

define b(n,m){ 

auto x, j 

x=l 

for(j=l; j<=m; j=j+l) x=x*(n-j+l)/j 

return(x) 

} 

The following function computes values of the exponential 
function by summing the appropriate series without regard for 
possible truncation errors: 
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scale = 20 
define e(x){ 

auto a, b, c, d, n 

a = l 

b = l 

c = 1 

d = 

n = l 

while(l==l){ 
a = a*x 



b = b*n 
c = c + a/b 
n = n + 1 
if(c==d) return(c) 
d = c 



ADDITIONAL FEATURES 

There are some additional language features that every user 
should know. 



Normally, statements are typed one to a line. It is also 
permissible, however, to type several statements on a line by 
separating the statements by semicolons. 

If an assignment statement is parenthesized, it then has a 
value; and it can be used anywhere that an expression can. For 
example, the input line 

(x=y+17) 

not only makes the indicated assignment, but also prints the 
resulting value. 
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The following is an example of a use of the value of an 
assignment statement even when it is not parenthesized. The 
input line 

X = a[i=i+ll 

causes a value to be assigned to x and also increments i before 
it is used as a subscript. 

The following constructs work in be in exactly the same 
manner as they do in the C language. Refer to Appendix 7.1 or 
the C language programming documents for more details. 



x=y=z 


is the 


same as 


x=(y=z) 


X =+ y 






X = x+y 


X =-y 






X = x-y 


X =*y 






X = x*y 


X =/y 






x = x/y 


X =% y 






X = x%y 


x=^y 






X = xV 


X+ + 






(x=x+l)-l 


X — 






(x=x-l)+l 


+ +X 






X = x+1 


— X 






X = x-1 



Warning: In some of these constructions, spaces are 
significant. There is a real difference between 
x=-y and x= -y. The first replaces x by 
x-y and the second by -y. 



The following are three important things to remember when 
using be programs: 
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• To exit a be program, type quit. 

• There is a comment convention identical to that of the C 
language. Comments begin with /* and end with */. 

• There is a library of math functions that may be obtained 
by typing at command level: 

bc-1 

This command loads a set of library functions that includes 
sine (s), cosine (c), arctangent (a), natural logarithm (1), 
exponential (e), and Bessel functions of integer order [j(n,x)l. 
The library sets the scale to 20, but it can be reset to another 
value. 

If you type 

be file ... 

the be program reads and executes the named file or files 
before accepting commands from the keyboard. In this way, 
programs and function definitions are loaded. 
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APPENDIX 8.1 



NOTATION 



In the following pages, syntactic categories are in italics and 
literals are in bold. Material in brackets "[]" is optional. 



TOKENS 

Tokens consist of keywords, identifiers, constants, operators, 
and separators. Token separators may be blanks, tabs, or 
comments. Newline characters or semicolons separate 
statements. 

Comments are introduced by the characters /* and terminated 
by */. 

There are three kinds of identifiers— ordinary, array, and 
function. All three types consist of single lowercase letters. 
Array identifiers are followed by square brackets, possibly 
enclosing an expression describing a subscript. Arrays are 
singly dimensioned and may contain up to 2048 elements. 
Indexing begins at zero so an array may be indexed from to 
2047. Subscripts are truncated to integers. Function identifiers 
are followed by parentheses, possibly enclosing arguments. The 
three types of identifiers do not conflict. A program can have a 
variable named x, an array named x, and a function named x; 
all of which are separate and distinct. 

The following are reserved keywords: 
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ibase 


if 


obase 


break 


scale 


define 


sqrt 


auto 


length 


return 


while 


quit 


for 





Constants consist of arbitrarily long numbers with an optional 
decimal point. The hexadecimal digits A through F are also 
recognized as digits with values 10 through 15, respectively. 



EXPRESSIONS 

The value of an expression is printed unless the main operator 
is an assignment. Precedence is the same as the order of 
presentation here with highest appearing first. Left or right 
associativity, where applicable, is discussed with each operator. 



Named Expressions 

Named expressions are places where values are stored. Simply 
stated, named expressions are legal on the left side of an 
assignment. The value of a named expression is the value 
stored in the place named. 



identifiers 

Simple identifiers are named expressions. They have an initial 
value of zero. 



array-namefexpressionj 

Array elements are named expressions. They have an initial 
value of zero. 
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scale, ihase, and obase 

The internal registers scale, ibase, and obase are all named 
expressions. The scale register is the number of digits after 
the decimal point to be retained in arithmetic operations. It 
has an initial value of zero. The ibase and obase registers 
are the input and output number radix, respectively. Both 
ibase and obase have initial values of ten. 



Function Calls 



function name (fexpressionf, expression.. JJ) 

A function call consists of a function name followed by 
parentheses containing a comma-separated list of expressions, 
which are the function arguments. A whole array passed as an 
argument is specified by the array name followed by empty 
square brackets. All function arguments are passed by value. 
As a result, changes made to the formal parameters have no 
effect on the actual arguments. If the function terminates by 
executing a return statement, the value of the function is the 
value of the expression in the parentheses of the return 
statement or is zero if no expression is provided or if there is 
no return statement. 



sqrt(expression) 

The result is the square root of the expression. The result is 
truncated in the least significant decimal place. The scale of 
the result is the scale of the expression or the value of scale, 
whichever is larger. 
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length(expression) 

The result is the total number of significant decimal digits in 
the expression. The scale of the result is zero. 



scale(expression) 

The result is the scale of the expression. The scale of the result 
is zero. 



Constants 

Constants are primitive expressions. 

Parentheses 

An expression surrounded by parentheses is a primitive 
expression. The parentheses are used to alter the normal 
precedence. 

The unary operators bind right to left. 

-expression 

The result is the negative of the expression. 

+ + named-expression 

The named expression is incremented by one. The result is the 
value of the named expression after incrementing. 
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— named-expression 



The named expression is decremented by one. The result is the 
value of the named expression after decrementing. 



named-expression++ 

The named expression is incremented by one. The result is the 
value of the named expression before incrementing. 



named-expression — 

The named expression is decremented by one. The result is the 
value of the named expression before decrementing. 



The exponentiation operator binds right to left. 

expression expression 

The result is the first expression raised to the power of the 
second expression. The second expression must be an integer. 
If a is the scale of the left expression and b is the absolute 
value of the right expression, then the scale of the result is 

min(axb,max(scale,a)) 

The operators *, /, and % bind left to right. 



expression * expression 

The result is the product of the two expressions. If a and b are 
the scales of the two expressions, then the scale of the result is 



min(a+b,max(scale,a,b)) 
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expression / expression 

The result is the quotient of the two expressions. The scale of 
the result is the value of scale. 



expression % expression 

The % operator produces the remainder of the division of the 
two expressions. More precisely, a%6 is a-a/b*b. 



The scale of the result is the sum of the scale of the divisor and 
the value of scale. 

The additive operators bind left to right. 



expression + expression 

The result is the sum of the two expressions. The scale of the 
result is the maximum of the scales of the expressions. 



expression - expression 

The result is the difference of the two expressions. The scale of 
the result is the maximum of the scales of the expressions. 



The assignment operators bind right to left. 



named-expression = expression 

This expression results in assigning the value of the expression 
on the right to the named expression on the left. 
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named-expression =+ expression 
named-expression =- expression 
named-expression =* expression 
named-expression =/ expression 
named-expression =% expression 
named-expression =' expression 



The result of the above expressions is equivalent to "named 
expression = named expression OP expression", v^here OP is 
the operator after the = sign. 



RELATIONAL OPERATORS 

Unlike all other operators, the relational operators are only 
valid as the object of an if or while statement or inside a for 
statement. 



expression < expression 
expression > expression 
expression <= expression 
expression >= expression 
expression == expression 
expression != expression 



STORAGE CLASSES 

There are only two storage classes in be— global and automatic 
(local). Only identifiers that are to be local to a function need 
be declared with the auto command. The arguments to a 
function are local to the function. All other identifiers are 
assumed to be global and available to all functions. All 
identifiers, global and local, have initial values of zero. 
Identifiers declared as auto are allocated on entry to the 
function and released on returning from the function. They 
therefore do not retain values between function calls. The auto 
arrays are specified by the array name followed by empty 
square brackets. 
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Automatic variables in be do not work in exactly the same way 
as in C language. On entry to a function, the old values of the 
names that appear as parameters and as automatic variables 
are pushed onto a stack. Until return is made from the 
function, reference to these names refers only to the new 
values. 



STATEMENTS 

Statements must be separated by a semicolon or newline. 
Except where altered by control statements, execution is 
sequential. 

When a statement is an expression unless the main operator is 
an assignment, the value of the expression is printed followed 
by a newline character. 

Statements may be grouped together and used when one 
statement is expected by surrounding them with braces { }. 

The following statement prints the string inside the quotes. 

" any string" 

if (relation)statement 
The substatement is executed if the relation is true. 

while (relation)statement 

The while statement is executed while the relation is true. 
The test occurs before each execution of the statement. 

for (expression; relation; expression)statement 
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The for statement is the same as 

first-expression 
while {relation) [ 

statement 

last-expression 

} 
All three expressions must be present, 
break 

The break statement causes termination of a for or while 

statement. 

auto identifier [,identifier] 

The auto statement causes the values of the identifiers to be 
pushed down. The identifiers can be ordinary identifiers or 
array identifiers. Array identifiers are specified by following 
the array name with empty square brackets. The auto 
statement must be the first statement in a function definition. 



define ( [parameter [,parameter... ]]){ 
statements} 

The define statement defines a function. The parameters may 
be ordinary identifiers or array names. Array names must be 
followed by empty square brackets. 

return 

return(expression) 
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The return statement causes the following: 

• Termination of a function 

• Popping of the auto variables on the stack 

• Specifies the results of the function. 

The first form is equivalent to return(O). The result of the 
function is the result of the expression in parentheses. 

The quit statement stops execution of a be program and 
returns control to the UNIX system software when it is first 
encountered. Because it is not treated as an executable 
statement, it cannot be used in a function definition or in an if, 
for, or while statement. 
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Chapter 20 

INTERACTIVE DESK 
CALCULATOR~"dc" 



GENERAL 

The dc program is an interactive desk calculator program 
implemented on the UNIX operating system to do arbitrary- 
precision integer arithmetic. It has provisions for manipulating 
scaled fixed-point numbers and for input and output in bases 
other than decimal. 



The size of numbers that can be manipulated by dc is limited 
only by available core storage. On typical implementations of 
the UNIX system, the size of numbers that can be handled 
varies from several hundred on the smallest systems to several 
thousand on the largest. 

The dc program works like a stacking calculator using reverse 
Polish notation. Ordinarily, dc operates on decimal integers; 
but an input base, output base, and a number of fractional 
digits to be maintained can be specified. 

A language called BC has been developed which accepts 
programs written in the familiar style of higher-level 
programming languages and compiles the output which is 
interpreted by dc. Some of the commands described below 
were designed for the compiler interface and are not easy for a 
human user to manipulate. 

Numbers that are typed into dc are put on a pushdown stack. 
The dc commands work by taking the top number or two off 
the stack, performing the desired operation, and pushing the 
result on the stack. If an argument is given, input is taken 
from that file until its end, then it is taken from the standard 
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input. 



DC COMMANDS 

Any number of commands are permitted on a line. Blanks and 
new-line characters are ignored except within numbers and in 
places where a register name is expected. 

The following constructions are recognized: 

number (e.g. 244) 

The value of a number is pushed onto the stack. A number is 
an unbroken string of digits through 9 and uppercase letters 
A through F (treated as digits with values 10 through 15, 
respectively). The number may be preceded by an underscore 
(_) ^o input a negative number and numbers may contain 
decimal points. 

The top two values on the stack are added (+), subtracted (— ), 
multiplied (*), divided (/), remaindered (%), or exponentiated 
C) by using 

+ - * / % ^ 

The two entries are popped off the stack, and the result is 
pushed on the stack in their place. The result of a division is 
an integer truncated toward zero. An exponent must not have 
any digits after the decimal point. 



sx 



The top of the main stack is popped and stored in a register 
named x (where x may be any character). If s is uppercase, x 
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is treated as a stack; and the value is pushed onto it. Any 
character, even blank or newline, is a valid register name. 

The value of register x is pushed onto the stack. Register x is 
not altered. If the 1 in 



\x 



is uppercase, register x is treated as a stack, and its top value is 
popped onto the main stack. All registers start with empty 
value which is treated as a zero by the command 1 and is 
treated as an error by the command L. 

The following characters perform the stated tasks: 



The top value on the stack is duplicated. 



The top value on the stack is printed. The top value remains 
unchanged. 



All values on the stack and in registers are printed. 



Treats the top element of the stack as a character string, 
removes it from the stack, and executes it as a string of dc 
commands. 
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Puts the bracketed character string onto the top of the stack. 

q 

Exits the program. If executing a string, the recursion level is 
popped by two. If q is uppercase, the top value on the stack is 
popped; and the string execution level is popped by that value. 

<x >x =x \<x \>x l=x 

The top two elements of the stack are popped and compared. 
Register x is executed if they obey the stated relation. 
Exclamation point is negation. 

V 

Replaces the top element on the stack by its square root. The 
square root of an integer is truncated to an integer. 



Interprets the rest of the line as a UNIX software command. 
Control returns to dc when the command terminates. 



All values on the stack are popped; the stack becomes empty. 



The top value on the stack is popped and used as the number 
radix for further input. If i is uppercase, the value of the input 
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base is pushed onto the stack. No mechanism has been 
provided for the input of arbitrary numbers in bases less than 1 
or greater than 16. 



The top value on the stack is popped and used as the number 
radix for further output. If o is uppercase, the value of the 
output base is pushed onto the stack. 



The top of the stack is popped, and that value is used as a scale 
factor that influences the number of decimal places that are 
maintained during multiplication, division, and exponentiation. 
The scale factor must be greater than or equal to zero and less 
than 100. If k is uppercase, the value of the scale factor is 
pushed onto the stack. 



The value of the stack level is pushed onto the stack. 



A line of input is taken from the input source (usually the 
console) and executed. 
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INTERNAL REPRESENTATION OF 
NUMBERS 

Numbers are stored internally using a dynamic storage 
allocator. Numbers are kept in the form of a string of digits to 
the base 100 stored one digit per byte (centennial digits). The 
string is stored with the low-order digit at the beginning of the 
string. For example, the representation of 157 is 57,1. After 
any arithmetic operation on a number, care is taken that all 
digits are in the range to 99 and that the number has no 
leading zeros. The number zero is represented by the empty 
string. 

Negative numbers are represented in the 100s complement 
notation, which is analogous to twos complement notation for 
binary numbers. The high-order digit of a negative number is 
always -1 and all other digits are in the range to 99. The 
digit preceding the high-order -1 digit is never a 99. The 
representation of -157 is 43,98,-1. This is called the canonical 
form of a number. The advantage of this kind of 
representation of negative numbers is ease of addition. When 
addition is performed digit by digit, the result is formally 
correct. The result need only be modified, if necessary, to put it 
into canonical form. 

Because the largest valid digit is 99 and the byte can hold 
numbers twice that large, addition can be carried out and the 
handling of carries done later when it is convenient. 

An additional byte is stored with each number beyond the 
high-order digit to indicate the number of assumed decimal 
digits after the decimal point. The representation of .001 is 1,3 
where the scale has been italicized to emphasize the fact that it 
is not the high-order digit. The value of this extra byte is 
called the scale factor of the number. 
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THE ALLOCATOR 

The dc program uses a dynamic string storage allocator for all 
of its internal storage. All reading and writing of numbers 
internally is through the allocator. Associated with each string 
in the allocator is a 4-word header containing pointers to the 
beginning of the string, the end of the string, the next place to 
write, and the next place to read. Communication between the 
allocator and dc is via pointers to these headers. 

The allocator initially has one large string on a list of free 
strings. All headers except the one pointing to this string are 
on a list of free headers. Requests for strings are made by size. 
The size of the string actually supplied is the next higher power 
of two. When a request for a string is made, the allocator first 
checks the free list to see if there is a string of the desired size. 
If none is found, the allocator finds the next larger free string 
and splits it repeatedly until it has a string of the right size. 
Leftover strings are put on the free list. If there are no larger 
strings, the allocator tries to combine smaller free strings into 
larger ones. Since all strings are the result of splitting large 
strings, each string has a neighbor that is next to it in core 
and, if free, can be combined with it to make a string twice as 
long. 

If a string of the proper length cannot be found, the allocator 
asks the system for more space. The amount of space on the 
system is the only limitation on the size and number of strings 
in dc. If the allocator runs out of headers at any time in the 
process of trying to allocate a string, it also asks the system for 
more space. 

There are routines in the allocator for reading, writing, 
copying, rewinding, forward spacing, and backspacing strings. 
All string manipulation is done using these routines. 

The reading and writing routines increment the read pointer or 
write pointer so that the characters of a string are read or 
written in succession by a series of read or write calls. The 
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write pointer is interpreted as the end of the information- 
containing portion of a string and a call to read beyond that 
point returns an end of string indication. An attempt to write 
beyond the end of a string causes the allocator to allocate a 
larger space and then copy the old string into the larger block. 



INTERNAL ARITHMETIC 

All arithmetic operations are done on integers. The operands 
(or operand) needed for the operation are popped from the 
main stack and their scale factors stripped off. Zeros are added 
or digits removed as necessary to get a properly scaled result 
from the internal arithmetic routine. For example, if the scale 
of the operands is different and decimal alignment is required, 
as it is for addition, zeros are appended to the operand with the 
smaller scale. After performing the required arithmetic 
operation, the proper scale factor is appended to the end of the 
number before it is pushed on the stack. 

A register called scale plays a part in the results of most 
arithmetic operations. The scale register limits the number of 
decimal places retained in arithmetic computations. The scale 
register may be set to the number on the top of the stack 
truncated to an integer with the k command. The K command 
may be used to push the value of scale on the stack. The value 
of scale must be greater than or equal to and less than 100. 
The descriptions of the individual arithmetic operations 
includes the exact effect of scale on the computations. 
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ADDITION AND SUBTRACTION 

The scales of the two numbers are compared and trailing zeros 
are supplied to the number with the lower scale to give both 
numbers the same scale. The number with the smaller scale is 
multiplied by 10 if the difference of the scales is odd. The scale 
of the result is then set to the larger of the scales of the two 
operands. 



Subtraction is performed by negating the number to be 
subtracted and proceeding as in addition. 

The addition is performed digit by digit from the low-order end 
of the number. The carries are propagated in the usual way. 
The resulting number is brought into canonical form, which 
may require stripping of leading zeros, or for negative numbers, 
replacing the high-order configuration 99,-1 by the digit -1. In 
any case, digits that are not in the range through 99 must be 
brought into that range, propagating any carries or borrows 
that result. 



MULTIPLICATION 

The scales are removed from the two operands and saved. The 
operands are both made positive. Then multiplication is 
performed in a digit by digit manner that exactly follows the 
hand method of multiplying. The first number is multiplied by 
each digit of the second number, beginning with its low-order 
digit. The intermediate products are accumulated into a partial 
sum which becomes the final product. The product is put into 
the canonical form and its sign is computed from the signs of 
the original operands. 

The scale of the result is set equal to the sum of the scales of 
the two operands. If that scale is larger than the internal 
register scale and also larger than both of the scales of the 
two operands, then the scale of the result is set equal to the 
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largest of these three last quantities. 



DIVISION 

The scales are removed from the two operands. Zeros are 
appended, or digits are removed from the dividend to make the 
scale of the result of the integer division equal to the internal 
quantity scale. The signs are removed and saved. 

Division is performed much as it would be done by hand. The 
difference of the lengths of the two numbers is computed. If 
the divisor is longer than the dividend, zero is returned. 
Otherwise, the top digit of the divisor is divided into the top 
two digits of the dividend. The result is used as the first 
(high-order) digit of the quotient. If it turns out to be one unit 
too low, the next trial quotient is larger than 99; and this is 
adjusted at the end of the process. The trial digit is multiplied 
by the divisor, the result subtracted from the dividend, and the 
process is repeated to get additional quotient digits until the 
remaining dividend is smaller than the divisor. At the end, the 
digits of the quotient are put into the canonical form with 
propagation of carry as needed. The sign is set from the sign of 
the operands. 



REMAINDER 

The division routine is called, and division is performed exactly 
as described. The quantity returned is the remains of the 
dividend at the end of the divide process. Since division 
truncates toward zero, remainders have the same sign as the 
dividend. The scale of the remainder is set to the maximum of 
the scale of the dividend and the scale of the quotient plus the 
scale of the divisor. 
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SQUARE ROOT 

The scale is removed from the operand. Zeros are added if 
necessary to make the integer result have a scale that is the 
larger of the internal quantity scale and the scale of the 
operand. The method used to compute the square root is 
Newton's method with successive approximations by the rule. 

The initial guess is found by taking the integer square root of 
the top two digits. 



EXPONENTIATION 

Only exponents with scale factor are handled. If the exponent 
is 0, then the result is 1. If the exponent is negative, then it is 
made positive; and the base is divided into 1. The scale of the 
base is removed. 



The integer exponent is viewed as a binary number. The base 
is repeatedly squared, and the result is obtained as a product of 
those powers of the base that correspond to the positions of the 
one-bits in the binary representation of the exponent. Enough 
digits of the result are removed to make the scale of the result 
the same as if the indicated multiplication had been performed. 



INPUT CONVERSION AND BASE 

Numbers are converted to the internal representation as they 
are read in. The scale stored with a number is simply the 
number of fractional digits input. Negative numbers are 

indicated by preceding the number with an underscore ( ). 

The hexadecimal digits A through F correspond to the numbers 
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10 through 15 regardless of input base. The i command can be 
used to change the base of the input numbers. This command 
pops the stack, truncates the resulting number to an integer, 
and uses it as the input base for all further input. The input 
base (ibase) is initialized to 10 (decimal) but may, for example, 
be changed to 8 or 16 for octal or hexadecimal to decimal 
conversions. The command I pushes the value of the input base 
on the stack. 



OUTPUT COMMANDS 

The command p causes the top of the stack to be printed. It 
does not remove the top of the stack. All of the stack and 
internal registers are output by typing the command f. The o 
command is used to change the output base (obase). This 
command uses the top of the stack truncated to an integer as 
the base for all further output. The output base in initialized 
to 10 (decimal). It works correctly for any base. The command 
O pushes the value of the output base on the stack. 



OUTPUT FORMAT AND BASE 

The input and output bases only affect the interpretation of 
numbers on input and output; they have no effect on arithmetic 
computations. Large numbers are output with 70 characters 
per line; a backslash (\) indicates a continued line. All choices 
of input and output bases work correctly, although not all are 
useful. A particularly useful output base is 100000, which has 
the effect of grouping digits in fives. Bases of 8 and 16 are used 
for decimal-octal or decimal-hexadecimal conversions. 
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INTERNAL REGISTERS 

Numbers or strings may be stored in internal registers or 
loaded on the stack from registers with the commands s and 1. 
The command sx pops the top of the stack and stores the result 
in register x. The x can be any character. The command la; 
puts the contents of register x on the top of the stack. The 1 
command has no effect on the contents of register x. The s 
command, however, is destructive. 



STACK COMMANDS 

The command c clears the stack. The command d pushes a 
duplicate of the number on the top of the stack onto the stack. 
The command z pushes the stack size on the stack. The 
command X replaces the number on the top of the stack with 
its scale factor. The command Z replaces the top of the stack 
with its length. 



SUBROUTINE DEFINITIONS AND CALLS 

Enclosing a string in brackets "[]" pushes the ASCII string on 
the stack. The q command quits or (in executing a string) pops 
the recursion levels by two. 



INTERNAL REGISTERS— PROGRAMMING 

DC 

The load and store commands, together with "[]" to store 
strings, the x command to execute, and the testing commands 
(<, >, =, !<, !>, !=), can be used to program dc. The x 
command assumes the top of the stack is a string of dc 
commands and executes it. The testing commands compare the 
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top two elements on the stack and, if the relation holds, execute 
the register that follows the relation. For example, to print the 
numbers through 9, 



[lipl+ si lilO>a]sa 
Osi lax 



PUSHDOWN REGISTERS AND ARRAYS 

These commands are designed for use by a compiler, not 
directly by programmers. They involve pushdown registers and 
arrays. In addition to the stack that commands work on, dc 
can be thought of as having individual stacks for each register. 
These registers are operated on by the commands S and L. Sx 
pushes the top value of the main stack onto the stack for the 
register x. La? pops the stack for register x and puts the result 
on the main stack. The commands s and 1 also work on 
registers but not as pushdown stacks. The command 1 does not 
affect the top of the register stack, but s destroys what was 
there before. 

The commands to work on arrays are : and ;. The command :x 
pops the stack and uses this value as an index into the array x. 
The next element on the stack is stored at this index in x. An 
index must be greater than or equal to and less than 2048. 
The command ;x loads the main stack from the array x. The 
value on the top of the stack is the index into the array x of the 
value to be loaded. 
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MISCELLANEOUS COMMANDS 

The command ! interprets the rest of the line as a UNIX 
software command and passes it to the UNIX operating system 
to execute. One other compiler command is Q. This command 
uses the top of the stack as the number of levels of recursion to 
skip. 



DESIGN CHOICES 

The real reason for the use of a dynamic storage allocator is 
that a general purpose program can be used for a variety of 
other tasks. The allocator has some value for input and for 
compiling (i.e., the bracket [...] commands) where it cannot be 
known in advance how long a string will be. The result is that 
at a modest cost in execution time: 



• All considerations of string allocation and sizes of strings 
are removed from the remainder of the program. 

• Debugging is made easier. 

• The allocation method used wastes approximately 25 
percent of available space. 

The choice of 100 as a base for internal arithmetic seemingly 
has no compelling advantage. Yet the base cannot exceed 127 
because of hardware limitations and at the cost of 5 percent in 
space debugging was made a great deal easier, and decimal 
output was made much faster. 

The reason for a stack-type arithmetic design was to permit all 
dc commands from addition to subroutine execution to be 
implemented in essentially the same way. The result was a 
considerable degree of logical separation of the final program 
into modules with very little communication between modules. 
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The rationale for the lack of interaction between the scale and 
the bases is to provide an understandable means of proceeding 
after a change of base or scale (when numbers had already been 
entered). An earlier implementation which had global notions 
of scale and base did not work out well. If the value of scale is 
interpreted in the current input or output base, then a change 
of base or scale in the midst of a computation causes great 
confusion in the interpretation of the results. The current 
scheme has the advantage that the value of the input and 
output bases are only used for input and output, respectively, 
and they are ignored in all other operations. The value of scale 
is not used for any essential purpose by any part of the 
program. It is used only to prevent the number of decimal 
places resulting from the arithmetic operations from growing 
beyond all bounds. 

The rationale for the choices for the scales of the results of 
arithmetic is that in no case should any significant digits be 
thrown away if, on appearances, the user actually wanted them. 
Thus, if the user wants to add the numbers 1.5 and 3.517, it 
seemed reasonable to give them the result 5.017 without 
requiring to unnecessarily specify rather obvious requirements 
for precision. 

On the other hand, multiplication and exponentiation produce 
results with many more digits than their operands. It seemed 
reasonable to give as a minimum the number of decimal places 
in the operands but not to give more than that number of digits 
unless the user asked for them by specifying a value for scale. 
Square root can be handled in just the same way as 
multiplication. The operation of division gives arbitrarily many 
decimal places, and there is simply no way to guess how many 
places the user wants. In this case only, the user must specify a 
scale to get any decimal places at all. 

The scale of remainder was chosen to make it possible to 
recreate the dividend from the quotient and remainder. This is 
easy to implement; no digits are thrown away. 
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Chapter 21 



LEXICAL ANALYZER GENERATOR— 

"lex" 



GENERAL 

The lex is a program generator that produces a program in a 
general purpose language that recognizes regular expressions. 
It is designed for lexical processing of character input streams. 
It accepts a high-level, problem oriented specification for 
character string matching. The regular expressions are 
specified by you (the user) in the source specifications given to 
lex. The lex program generator source is a table of regular 
expressions and corresponding program fragments. The table 
is translated to a program that reads an input stream, copies 
the input stream to an output stream, and partitions the input 
into strings that match the given expressions. As each such 
string is recognized, the corresponding program fragment is 
executed. The recognition of the expressions is performed by a 
deterministic finite automaton generated by lex. The program 
fragments written by you are executed in the order in which 
the corresponding regular expressions occur in the input 
stream. 

The user supplies the additional code beyond expression 
matching needed to complete the tasks, possibly including codes 
written by other generators. The program that recognizes the 
expressions is generated in the general purpose programming 
language employed for your program fragments. Thus, a high- 
level expression language is provided to write the string 
expressions to be matched while your freedom to write actions 
is unimpaired. 

The lex written code is not a complete language, but rather a 
generator representing a new language feature which can be 
added to different programming languages, called "host 
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languages". Just as general purpose languages can produce 
code to run on different computer hardware, lex can write code 
in different host languages. The host language is used for the 
output code generated by lex and also for the program 
fragments added by the user. Compatible run-time libraries for 
the different host languages are also provided. This makes lex 
adaptable to different environments and different users. Each 
application may be directed to the combination of hardware 
and host language appropriate to the task, the user's 
background, and the properties of local implementations. At 
present, the only supported host language is the C language, 
although Fortran (in the form of Ratfor) has been available in 
the past. The lex generator exists on the UNIX operating 
system, but the codes generated by lex may be taken anywhere 
the appropriate compilers exist. 

The lex program generator turns the user's expressions and 
actions (called source) into the host general purpose language; 
the generated program is named yylex. The yylex program 
recognizes expressions in a stream (called input) and performs 
the specified actions for each expression as it is detected. See 
Figure 21-1. 



Source . 



Input — p. 



Lex 



yylex 



— ► yylex 



Output 



Figure 21-1. Overview of lex 
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For example, consider a program to delete from the input all 
blanks or tabs at the ends of lines. 



%% 

[ \t]+$ ; 

is all that is required. The program contains a %% delimiter 
to mark the beginning of the rules. This rule contains a regular 
expression that matches one or more instances of the 
cliaracters blank or tab (written \t for visibility, in accordance 
with the C language convention) and occurs prior to the end of 
a line. The brackets indicate the character class made of blank 
and tab; the + indicates "one or more ..."; and the $ indicates 
"end of line," as in QED. No action is specified, so the 
program generated by lex yylexQ ignores these characters. 
Everything else is copied. To change any remaining string of 
blanks or tabs to a single blank, add another rule. 

%% 

[ \t]+$ ; 

[ \t]+ printfC " ); 

The coded instructions (generated for this source) scan for both 
rules at once, observe (at the termination of the string of 
blanks or tabs) whether or not there is a newline character, and 
then execute the desired rule action. The first rule matches all 
strings of blanks or tabs at the end of lines, and the second rule 
matches all remaining strings of blanks or tabs. 

The lex program generator can be used alone for simple 
transformations or for analysis and statistics gathering on a 
lexical level. The lex generator can also be used with a parser 
generator to perform the lexical analysis phase; it is 
particularly easy to interface lex and yacc. The lex program 
recognizes only regular expressions; yacc writes parsers that 
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accept a large class of context free grammars but requires a 
lower level analyzer to recognize input tokens. Thus, a 
combination of lex and yacc is often appropriate. When used 
as a preprocessor for a later parser generator, lex is used to 
partition the input stream; and the parser generator assigns 
structure to the resulting pieces. The flow of control in such a 
case is shown in Figure 21-2. 
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1 




yylex 
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yyparse 
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Figure 21-2. Lex With Yacc 

Additional programs, written by other generators or by hand, 
can be added easily to programs written by lex. You will 
realize that the name yylex is what yacc expects its lexical 
analyzer to be named, so that the use of this name by lex 
simplifies interfacing. 

In the program written by lex, the user's fragments 
(representing the actions to be performed as each regular 
expression is found) are gathered as cases of a switch. The 
automaton interpreter directs the control flow. Opportunity is 
provided for the user to insert either declarations or additional 
statements in the routine containing the actions or to add 
subroutines outside this action routine. 



The lex program generator is not limited to a source that can 
be interpreted on the basis of one character look-ahead. For 
example, if there are two rules, one looking for "ab" and 
another for "abcdefg" and the input stream is "abcdefh," lex 
recognizes "ab" and leaves the input pointer just before "cd ...". 
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Such backup is more costly than the processing of simpler 
languages. 



lex SOURCE 

The general format of lex source is 

{definitions} 

%% 

{rules} 

%% 

{user subroutines} 

where the definitions and the user subroutines are often 
omitted. The first %% is required to mark the beginning of 
the rules, but the second %% is optional. The absolute 
minimum Lex program is 

%% 

(no definitions, no rules) which translates into a program that 
copies the input to the output unchanged. 

In the outline of lex programs shown above, the rules represent 
your control decisions. They are in a table containing 

• A left column with regular expressions 

• A right column with actions and program fragments to be 
executed when the expressions are recognized. 

Thus an individual rule might be 

integer printf(" found keyword INT" ); 
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to look for the string integer in the input stream and print the 
message " found keyword INT" whenever it appears. In this 
example, the host procedural language is C, and the C language 
library function printf is used to print the string. The end of 
the expression is indicated by the first blank or tab character. 
If the action is merely a single C language expression, it can 
just be given on the right side of the line; if it is compound or 
takes more than a line, it should be enclosed in braces. As a 
more useful example, suppose you desire to change a number of 
words from British to American spelling. The lex rules such 
as: 

colour printf(" color" ); 

mechanise printf(" mechanize" ); 
petrol printf(" gas" ); 

would be a start. These rules are not sufficient since the word 
" petroleum" would become " gaseum" . 



lex REGULAR EXPRESSIONS 

The definitions of regular expressions are very similar to those 
in QED. A regular expression specifies a set of strings to be 
matched. It contains text characters (which match the 
corresponding characters in the strings being compared) and 
operator characters (which specify repetitions, choices, and 
other features). The letters of the alphabet and the digits are 
always text characters; the regular expression 

integer 

matches the string "integer" wherever it appears, and the 
expression 



a57D 
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looks for the string "a57D". 

Operators 

The operator characters are 

" \ []'-?.*+ I ()$/{}% <> 

and if they are to be used as text characters, an escape should 
be used. The quotation mark operator " indicates that 
whatever is contained between a pair of quotes is to be taken as 
text characters. Thus: 

xyz" ++" 

matches the string xyz++ when it appears. Note that a part 
of a string may be quoted. It is harmless, but unnecessary, to 
quote an ordinary text character; the expression 

" xyz++" 

is equivalent to the one above. Thus, by quoting every 
nonalphanumeric character being used as a text character, the 
user can avoid remembering the list above of current operator 
characters and is safe should further extensions to lex 
lengthen the list. 

An operator character may also be turned into a text character 
by preceding it with a backslash (\) as in 

xyz\+\+ 

which is another, less readable, equivalent of the above 
expressions. Another use of the quoting mechanism is to get a 
blank into an expression; normally, as explained above, blanks 
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or tabs end a rule. Any blank character not contained within [1 
(see below) must be quoted. Several normal C language escapes 
with \ are recognized: \ii is newline, \t is tab, and \b is 
backspace. To enter \ itself, use \\. Since newline is illegal in 
an expression, \n must be used; it is not required to escape tab 
and backspace. Every character except blank, tab, newline, and 
the list of operator characters above is always a text character. 



Character Classes 

Classes of characters can be specified using the operator pair []. 
The construction [abc] matches a single character which may 
be "a", "b", or "c". Within square brackets, most operator 
meanings are ignored. Only three characters are special; these 
are \, — , and . The — character indicates ranges. For 
example, 

[a-z0-9<>_] 

indicates the character class containing all the lowercase 
letters, the digits, the angle brackets, and underline. Ranges 
may be given in either order. Using - between any pair of 
characters which are not both uppercase letters, both lowercase 
letters, or both digits is implementation dependent and gets a 
warning message (e.g., [0-zl in ASCII is many more characters 
than in EBCDIC). If it is desired to include the character - in 
a character class, it should be first or last; thus: 

[-+0-9] 

matches all the digits and the two signs. 

In character classes, the operator must appear as the first 
character after the left bracket to indicate that the resulting 
string is complemented with respect to the computer character 
set. Thus: 
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[abc] 

matches all characters except "a", "b", or "c", including all 
special or control characters; or 

fa-zA-Z] 

is any character that is not a letter. The \ character provides 
the usual escapes within character class brackets. 



Arbitrary Character 

To match almost any character, the operator character (dot) 



is the class of all characters except newline. Escaping into 
octal is possible although nonportable. 

[\40-\l'76] 

matches all printable ASCII characters from octal 40 (blank) to 
octal 176 (tilde). 



Optional Expressions 

The operator ? indicates an optional element of an expression. 
Thus: 



ab?c 
matches either "ac" or "abc". 
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Repeated Expressions 



Repetitions of classes are indicated by the operators * and +. 
For example. 



is any number of consecutive "a" characters, including zero; 
while 



a+ 
is one or more instances of "a". For example, 

[a-z] + 
is all strings of lowercase letters, and 

[A-Za-z][A-Za-zO-9]* 

indicates all alphanumeric strings with a leading alphabetic 
character. This is a typical expression for recognizing 
identifiers in computer languages. 

Alternation and Grouping 

The operator | indicates alternation. For example, 

(ab I cd) 

matches either "ab" or "cd". Note that parentheses are used 
for grouping, although they are not necessary on the outside 
level. For example, 
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ab I cd 

would have sufficed. Parentheses can be used for more complex 
expressions: 

(ab I cd+)?(ef)* 

matches such strings as "abefef", "efefef", "cdef", or "cddd"; 
but not "abc", "abed", or "abcdef '. 



Context Sensitivity 

The lex program recognizes a small amount of surrounding 
context. The two simplest operators for this are " and $. If the 
first character of an expression is , the expression is only 
matched at the beginning of a line (after a newline character or 
at the beginning of the input stream). This never conflicts with 
the other meaning of (complementation of character classes) 
since that only applies within the [] operators. If the very last 
character is $, the expression is only matched at the end of a 
line (when immediately followed by newline). The latter 
operator is a special case of the / operator character which 
indicates trailing context. The expression 

ab/cd 
matches the string "ab" but only if followed by "cd". Thus: 

ab$ 
is the same as 

ab/\n 
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Left context is handled in lex by "start conditions" as 
explained later. If a rule is only to be executed when the lex 
automaton interpreter is in start condition x, the rule should be 
prefixed by 



<x> 



using the angle bracket operator characters. If we considered 
"being at the beginning of a line" to be start condition ONE, 
then the " operator would be equivalent to 

<ONE> 

Start conditions are explained more fully later. 

Repetitions and Definitions 

The operators {} specify either repetitions (if they enclose 
numbers) or definition expansion (if they enclose a name). For 
example, 

{digit} 

looks for a predefined string named "digit" and inserts it at 
that point in the expression. The definitions are given in the 
first part of the lex input before the rules. In contrast, 

a{l,5} 

looks for 1 to 5 occurrences of "a". 

Finally, initial % is special, being the separator for lex source 
segments. 
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lex ACTIONS 

When an expression written as above is matched, lex executes 
the corresponding action. This part describes some features of 
lex that aid in writing actions. Note that there is a default 
action that consists of copying the input to the output. This is 
performed on all strings not otherwise matched. Thus, the lex 
user who wishes to absorb the entire input, without producing 
any output, must provide rules to match everything. When lex 
is being used with yacc, this is the normal situation. One may 
consider that actions are what is done instead of copying the 
input to the output; thus, in general, a rule that merely copies 
can be omitted. Also, a character combination that is omitted 
from the rules and that appears as input is likely to be printed 
on the output, thus calling attention to the gap in the rules. 

One of the simplest things that can be done is to ignore the 
input. Specifying a C language null statement, ; as an action 
causes this result. A frequent rule is 

[ \t\n] ; 

which causes the three spacing characters (blank, tab, and 
newline) to be ignored. 

Another easy way to avoid writing actions is the action 
character | which indicates that the action for this rule is the 
action for the next rule. The previous example could also have 
been written 



" \t" I 
" \n" ; 

with the same result although in different style. The quotes 
around \n and \t are not required. 
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In more complex actions, you may often want to know the 
actual text that matched some expression like "[a-z]+". The 
lex program leaves this text in an external character array. 
Thus, to print the name found, a rule like 

[a-z] + printf (" % s" , yytext); 

prints the string in yytextf]. The C language function printf 
accepts a format argument and data to be printed; in this case, 
the format is "print string" (% indicating data conversion, and 
s indicating string type), and the data are the characters in 
yytext[J. This places the matched string on the output. This 
action is so common that it may be written as ECHO: 

[a-z]+ ECHO; 

is the same as the above. Since the default action is just to 
print the characters found, one might ask why give a rule like 
this one which merely specifies the default action. Such rules 
are often required to avoid matching some other rule that is 
not desired. For example, if there is a rule that matches read, 
it normally matches the instances of read contained in bread 
or readjust. To avoid this, a rule of the form "[a-z]+" is 
needed. This is explained further below. 

Sometimes it is more convenient to know the end of what has 
been found; hence, lex also provides a count yyleng of the 
number of characters matched. To count both the number of 
words and the number of characters in words in the input, 
write 

[a-zA-Zl+ {words++; chars += yyleng;} 

which accumulates in chars the number of characters in the 
words recognized. The last character in the string matched can 
be accessed by 
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yytext[yyleng-l] 

Occasionally, a lex action may decide that a rule has not 
recognized the correct span of characters. Two routines are 
provided to aid with this situation. First, yymoreQ can be 
called to indicate that the next input expression recognized is to 
be tacked on to the end of this input. Normally, the next input 
string would overwrite the current entry in yytext. Second, 
yyless(n) may be called to indicate that not all the characters 
matched by the currently successful expression are wanted 
right now. The argument "n" indicates the number of 
characters in yytext to be retained. Further characters 
previously matched are returned to the input. This provides 
the same sort of look ahead offered by the / operator but in a 
different form. 

Example: 

Consider a language that defines a string as a set of characters 
between quotation (" ) marks and provides that to include a (" ) 
in a string it must be preceded by a \. The regular expression 
which matches that is somewhat confusing, so that it might be 
preferable to write 

Vf"]* { 

if (yytext[yyleng-l] == '\V) 

yymoreO; 
else 

... normal user processing 

} 

will, when faced with a string such as " abc\" def" , first match 
the five characters " abc\; then the call to yymoreQ will cause 
the next part of the string " def to be tacked on the end. Note 
that the final quote terminating the string should be picked up 
in the code labeled "normal processing". 
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The function yylessQ might be used to reprocess text in various 
circumstances. Consider the C language problem of 
distinguishing the ambiguity of "=-a ". Suppose it is desired to 
treat this as "=- a" but also to print a message: a rule might be 

=-[a-zA-Z] { 

printf(" Operator (=-) ambiguous\n" ); 

yyless(yyleng-l); 

... action for =- ... 

} 

which prints a message, returns the letter after the operator to 
the input stream, and treats the operator as "=- ". 
Alternatively, it might be desired to treat this as "=-a ". To do 
this, just return the minus sign as well as the letter to the 
input. 

=-[a-zA-Zl { 

printf(" Operator (=-) ambiguous\n" ); 

yyless(yyleng-2); 

... action for = ... 

} 

performs the other interpretation. Note that the expressions 
for the two cases might more easily be written 

=-/[A-Za-zl 

in the first case, and 

=/-[A-Za-z] 

in the second; no backup is required in the rule action. It is not 
necessary to recognize the whole identifier to observe the 
ambiguity. The possibility of "=-3", however, makes 
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=-/C \t\n] 

a still better rule. 

In addition to these routines, lex also permits access to the I/O 
routines it uses. They are as follows: 

1. inputQ returns the next input character. 

2. output(c) writes the character "c" on the output. 

3. unput(c) pushes the character "c" back onto the input 
stream to be read later by inputQ. 

By default, these routines are provided as macro definitions; 
but the user can override them and supply private versions. 
These routines define the relationship between external files 
and internal characters and must all be retained or modified 
consistently. They may be redefined to cause input or output to 
be transmitted to or from strange places including other 
programs or internal memory. The character set used must be 
consistent in all routines and a value of zero returned by input 
must mean end of file. The relationship between unput and 
input must be retained or the lex look ahead will not work. 
The lex program does not look ahead at all if it does not have 
to, but every rule ending in +,*,?, or $ or containing / implies 
look ahead. Look ahead is also necessary to match an 
expression that is a prefix of another expression. The standard 
lex library imposes a 100-character limit on backup. 

Another lex library routine that you may sometimes want to 
redefine is yywrapQ which is called whenever lex reaches an 
end of file. If yywrap returns a 1, lex continues with the 
normal wrap up on end of input. Sometimes, however, it is 
convenient to arrange for more input to arrive from a new 
source. In this case, the user should provide a yywrap which 
arranges for new input and returns 0. This instructs lex to 
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continue processing. The default yywrap always returns 1. 

This routine is also a convenient place to print tables, 
summaries, etc., at the end of a program. Note that it is not 
possible to write a normal rule that recognizes end of file; the 
only access to this condition is through yywrap. In fact, unless 
a private version of inputQ is supplied, a file containing nulls 
cannot be handled since a value of returned by input is taken 
to be end of file. 



AMBIGUOUS SOURCE RULES 

The lex program can handle ambiguous specifications. When 
more than one expression can match the current input, lex 
chooses as follows: 



1. The longest match is preferred. 

2. Among rules that matched the same number of 
characters, the rule given first is preferred. 

Thus, suppose the rules 

integer keyword action ...; 
[a-z]+ identifier action ...; 

are to be given in that order. If the input is "integers", it is 
taken as an identifier because 

"[a-z]+" 

matches eight characters while "integer" matches only seven. 
If the input is "integer", both rules match seven characters; and 
the keyword rule is selected because it was given first. 
Anything shorter (e.g., "int") does not match the expression 
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"integer" and so the identifier interpretation is used. 

The principle of preferring the longest match makes rules 
containing expressions like .* dangerous. For example: 



might appear to be a good way of recognizing a string in single 
quotes. However, it is an invitation for the program to read far 
ahead looking for a distant single quote. Presented with the 
input 

'first' quoted string here, 'second' here 

the above expression will match 

'first' quoted string here, 'second' 

which is probably not what was wanted. A better rule is of the 
form 



'[''\nl 



which, on the above input, stops after ('first'). The 
consequences of errors like this are mitigated by the fact that 
the dot (.) operator does not match newline. Thus expressions 
like .* stop on the current line. Do not try to defeat this with 
expressions like [.\J*]+ or equivalents; the lex generated 
program tries to read the entire input file causing internal 
buffer overflows. 



Note that lex is normally partitioning the input stream not 
searching for all possible matches of each expression. This 
means that each character is accounted for once and only once. 
For example, suppose it is desired to count occurrences of both 
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"she" and "he" in an input text. Some lex rules to do this 
might be 



she S++; 
he h++; 
\n I 



where the last two rules ignore everything besides "he" and 
"she". Remember that dot (.) does not include newline. Since 
"she" includes "he", lex normally does not recognize the 
instances of "he" included in "she" since once it has passed a 
"she" those characters are gone. 

Sometimes the user desires to override this choice. The action 
REJECT means "go do the next alternative". It causes 
whatever rule was second choice after the current rule to be 
executed. The position of the input pointer is adjusted 
accordingly. Suppose you really want to count the included 
instances of "he". Use the following rule to change the 
previous example to accomplish the task. 

she {s++; REJECT;} 
he {h++; REJECT;} 



After counting each expression, it is rejected; whenever 
appropriate, the other expression is then counted. In this 
example, you could note that "she" includes "he" but not vice 
versa and omit the REJECT action on "he". In other cases, it 
is not possible to state which input characters are in both 
classes. 

Consider the two rules 
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a[bc]+ { ... ; REJECT;} 
a[cd]+ { ... ; REJECT;} 

If the input is "ab", only the first rule matches, and on "ad" 
only the second matches. The input string "accb" matches the 
first rule for four characters and then the second rule for three 
characters. In contrast, the input "accd" agrees with the 
second rule for four characters and then the first rule for three. 

In general, REJECT is useful whenever the purpose of lex is 
not to partition the input stream but to detect all examples of 
some items in the input, and the instances of these items may 
overlap or include each other. Suppose a digram table of the 
input is desired; normally, the digrams overlap, that is the 
word "the" is considered to contain both "th" and "he". 
Assuming a 2-dimensional array named digramf] to be 
incremented, the appropriate source is 

%% 

[a-z][a-z] {digram[yytext[0]][yytext[ll]++; REJECT;} 

\n '; 

where the REJECT is necessary to pick up a letter pair 
beginning at every character rather than at every other 
character. 

The action REJECT does not rescan the input; instead it 
remembers the results of the previous scan. This means that if 
a rule with trailing context is found and REJECT executed the 
user must not have used unput to change the characters 
forthcoming from the input stream. This is the only restriction 
on the user's ability to manipulate the not-yet-processed input. 
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LEX SOURCE DEFINITIONS 

Recalling the format of the lex source, 

{definitions} 

%% 

{rules} 

%% 

{user routines} 

So far, only the rules have been described. You need additional 
options to define variables for use in the program and for use 
by Lex. Variables can go either in the definitions section or in 
the rules section. 

Remember lex is generating the rules into a program. Any 
source not intercepted by lex is copied into the generated 
program. There are three classes of such things. 

1. Any line not part of a lex rule or action that begins with 
a blank or tab is copied into the lex generated program. 
Such source input prior to the first %% delimiter is 
external to any function in the code; if it appears 
immediately after the first %%, it appears in an 
appropriate place for declarations in the function written 
by lex which contains the actions. This material must 
look like program fragments and should precede the first 
lex rule. 

Lines that begin with a blank or tab and that contain a 
comment are passed through to the generated program. 
This can be used to include comments in either the lex 
source or the generated code; the comments should follow 
the host language convention. 

2. Anything included between lines containing only %{ and 
%} is copied out as above. The delimiters are discarded. 
This format permits entering text like preprocessor 
statements that must begin in column 1 or copying lines 
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that do not look like programs. 

3. Anything after the third %% delimiter, regardless of 
formats, etc., is copied out after the lex output. 

Definitions intended for lex are given before the first %% 
delimiter. Any line in this section not contained between %{ 
and %} and beginning in column 1 is assumed to define lex 
substitution strings. The format of such lines is: 

name translation 



This format causes the string given as a translation to be 
associated with the name. The name and translation must be 
separated by at least one blank or tab, and the name must 
begin with a letter. The translation can then be called out by 
the {name} syntax in a rule. Using {D} for the digits and {E} 
for an exponent field, for example, abbreviate rules to recognize 
numbers: 

D [0-9] 

E [DEdel[-+]?{D} + 

%% 

{D}+ printfC integer" ); 

{D}+"."{D}*({E})? I 

{D}*"."{D}+({E})? I 

{D} + {E} printfC real" ); 

Note the first two rules for real numbers; both require a 
decimal point and contain an optional exponent field. The first 
requires at least one digit before the decimal point, and the 
second requires at least one digit after the decimal point. To 
correctly handle the problem posed by a Fortran expression 
such as "SS.EQ.I", which does not contain a real number, a 
context-sensitive rule such as: 
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[0-9] +/" ." EQ printf (" integer" ); 

could be used in addition to the normal rule for integers. 

The definitions section may also contain other commands 
including the selection of a host language, a character set table, 
a list of start conditions, or adjustments to the default size of 
arrays within lex itself for larger source programs. These 
possibilities are discussed later. 

USAGE 

There are two steps in compiling a lex source program. First, 
the lex source must be turned into a generated program in the 
host general purpose language. Then this program must be 
compiled and loaded usually with a library of lex subroutines. 
The generated program is on a file named lex.yy.c. The I/O 
library is defined in terms of the C language standard library. 

On the UNIX operating system, the library is accessed by the 
loader flag -11. So an appropriate set of commands is 

lex source 
cc lex.yy.c -11 

The resulting program is placed on the usual file a.out for later 
execution. To use lex with yacc, see "LEX AND YACC" 
below. Although the default lex I/O routines use the C 
language standard library, the lex automata themselves do not 
do so; if private versions of input, output, and unput are given, 
the library is avoided. 
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LEX AND YACC 

To use lex with yacc, observe that lex writes a program 
named yylexQ (the name required by yacc for its analyzer). 
Normally, the default main program on the lex library calls 
this routine; but if yacc is loaded and its main program is 
used, yacc calls yylexQ. In this case, each lex rule ends with 

return(token); 

where the appropriate token value is returned. An easy way to 
get access to yacc's names for tokens is to compile the lex 
output file as part of the yacc output file by placing the line 

# include " lex.yy.c" 

in the last section of yacc input. If the grammar is to be 
named "good" and the lexical rules are to be named "better", 
the UNIX software command sequence could be 

yacc good 
lex better 
cc y.tab.c -ly -11 

The yacc library (— ly) should be loaded before the lex library 
to obtain a main program that invokes the yacc parser. The 
generations of lex and yacc programs can be done in either 
order. 



EXAMPLES 

As a problem, consider copying an input file while adding three 
to every positive number divisible by seven. A suitable lex 
source program follows: 
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%% 

int k; 
[0-9]+ { 

k = atoi(yytext); 
if (k%7==0) 

printf("%d",k+3); 
else 

printf("%d",k); 

} 

The rule "[0-9]+" recognizes strings of digits; atoiQ converts 
tlie digits to binary and stores the result in "k". The operator 
% (remainder) is used to check whether "k" is divisible by 
seven; if it is, "k" is incremented by three as it is written out. 
It may be objected that this program alters such input items as 
"49.63" or "X7". Furthermore, it increments the absolute value 
of all negative numbers divisible by seven. To avoid this, add a 
few more rules after the active one, as here: 

%% 

int k; 
-?[0-9]+ { 

k = atoi(yytext); 

printfC %d" , k%7 == ? k+3 : k); 

} 
-?[0-9.]+ ECHO; 

[A-Za-z][A-Za-zO-9]+ ECHO; 

Numerical strings containing a dot (.) or preceded by a letter 
will be picked up by one of the last two rules and not changed. 
The "if-else" has been replaced by a C language conditional 
expression to save space; the form "a?b:c" means "if a then b 
else c". 



For an example of statistics gathering, here is a program that 
histograms the lengths of words, where a word is defined as a 
string of letters: 
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int lengs[100]; 
%% 
[a-z]+ lengs[yyleng]++; 

I 

\n ; 

%% 

yywrap( ) 

{ 

int i; 

printf(" Length No. words\n" ); 
for(i=0; i<100; i++) 
if (lengs[i] > 0) 

printfC %5d%10d\n" ,i,lengs[i]); 
return(l); 

} 

This program accumulates the histogram while producing no 
output. At the end of the input, it prints the table. The final 
statement "return(l);" indicates that lex is to perform wrap 
up. If yywrap returns zero (false), it implies that further input 
is available and the program is to continue reading and 
processing. Providing a yywrap (that never returns true) 
causes an infinite loop. 



LEFT CONTEXT SENSITIVITY 

Sometimes it is desirable to have several sets of lexical rules to 
be applied at different times in the input. For example, a 
compiler preprocessor might distinguish preprocessor 
statements and analyze them differently from ordinary 
statements. This requires sensitivity to prior context, and 
there are several ways of handling such problems. The 
operator, for example, is a prior context operator recognizing 
immediately preceding left context just as $ recognizes 
immediately following right context. Adjacent left context 
could be extended to produce a facility similar to that for 
adjacent right context, but it is unlikely to be as useful since 
often the relevant left context appeared some time earlier such 
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as at the beginning of a line. 

This part describes three means of dealing with different 
environments: a simple use of flags (when only a few rules 
change from one environment to another), the use of "start 
conditions" on rules, and the possibility of making multiple 
lexical analyzers all run together. In each case, there are rules 
that recognize the need to change the environment in which the 
following input text is analyzed and that set a parameter to 
reflect the change. This may be a flag explicitly tested by the 
user's action code; this is the simplest way of dealing with the 
problem since lex is not involved at all. It may be more 
convenient, however, to have lex remember the flags as initial 
conditions on the rules. Any rule may be associated with a 
start condition. It is only recognized when lex is in that start 
condition. The current start condition may be changed at any 
time. Finally, if the sets of rules for the different 
environments are very dissimilar, clarity may be best achieved 
by writing several distinct lexical analyzers and switching from 
one to another as desired. 

Consider the following problem: copy the input to the output, 
changing the word " magic" to " first" on every line which 
began with the letter " a" , changing " magic" to " second" on 
every line which began with the letter " b" , and changing 
" magic" to " thiru" on every iine wiiicri uegan witii trie letter 
" c" . All other words and all other lines are left unchanged. 

These rules are so simple that the easiest way to do this job is 
with a flag. 
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int flag. 
%% 

\ {flag = 'a'; ECHO;} 
^b {flag = 'b'; ECHO;} 
'c {flag = 'c'; ECHO;} 
\n {flag = ; ECHO;} 
magic { 

switch (flag) 

{ 

case 'a': printf (" first" ); break; 
case 'b': printf(" second" ); break; 
case 'c': printf(" third" ); break; 
default: ECHO; break; 

} 
} 

should be adequate. 

To handle the same problem with start conditions, each start 
condition must be introduced to lex in the definitions section 
with a line reading 

% Start namel name2 ... 

where the conditions may be named in any order. The word 
"Start" may be abbreviated to "s" or "S". The conditions may 
be referenced at the head of a rule with <> brackets: 

<namel>expression 

is a rule that is only recognized when lex is in the start 
condition namel. To enter a start condition, execute the 
action statement 

BEGIN namel; 
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which changes the start condition to namel. To resume the 
normal state 



BEGIN 0; 

resets the initial condition of the lex automaton interpreter. A 
rule may be active in several start conditions. 

<namel,name2,name3> 

is a legal prefix. Any rule not beginning with the <> prefix 
operator is always active. 

The same example as before can be written as follows: 

% ST ART A ABB CC 
%% 

\ {ECHO; BEGIN AA;} 

^b {ECHO; BEGIN BB;} 

^c {ECHO; BEGIN CC;} 

\n {ECHO; BEGIN 0;} 

<AA>magic printf(" first" ); 

<BB>magic printf(" second" ); 

<-<^o^niagic prill Li ( uiiru ); 

where the logic is exactly the same as in the previous method of 
handling the problem, but lex does the work rather than the 
user's code. 
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CHARACTER SET 

The programs generated by lex handle character I/O only 
through the routines inputQ, outputQ, and unputQ. Thus, the 
character representation provided in these routines is accepted 
by lex and used to return values in yytextQ. For internal use, 
a character is represented as a small integer which, if the 
standard library is used, has a value equal to the integer value 
of the bit pattern representing the character on the host 
computer. Normally, the letter a is represented in the same 
form as the character constant 'a*. If this interpretation is 
changed by providing I/O routines that translate the 
characters, lex must be given a translation table that is in the 
definitions section and must be bracketed by lines containing 
only %T; the translation table contains lines of the form 

{integer} {character string} 

which indicate the value associated with each character. 



SUMMARY OF SOURCE FORMAT 

The general form of a lex source file is 

{definitions} 

%% 

{rules} 

%% 

{user subroutines} 

The definitions section contains a combination of 

1. Definitions in the form "name space translation". 

2. Included code in the form "space code". 
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3. Included code in the form: 

%{ 
code 

%} 

4. Start conditions given in the form: 

% S namel name2 ... 

5. Character set tables in the form: 

%T 

number space character-string 

%T 

6. Changes to internal array sizes in the form: 

%x nnn 

where "nnn" is a decimal integer representing an array size 
and "a" selects the parameter as follows: 



Letter 


Parameter 


P 


positions 


n 


states 


e 


tree nodes 


a 


transitions 


k 


packed character classes 





output array size 



Lines in the rules section have the form "expression action" 
where the action may be continued on succeeding lines by using 
braces to delimit it. 
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Regular expressions in lex use the following operators: 



X the character " x" . 

" x" an " x" , even if x is an operator. 

\x an " x" , even if x is an operator. 

[xy] the character x or y. 

[x-z] the characters x, y, or z. 

[ x] any character but x. 

any character but newline. 

X an X at the beginning of a line. 

<y>x an x when Lex is in start condition y. 

x$ an X at the end of a line, 

x? an optional x. 

X* 0,1,2, ... instances of x. 

x+ 1,2,3, ... instances of x. 

x|y an x or a y. 

(x) an x. 

x/y an x but only if followed by y. 

{xx} the translation of xx from 

the definitions section. 

x{m,n} m through n occurrences of x. 



CAVEATS AND BUGS 

There are pathological expressions that produce exponential 
growth of the tables when converted to deterministic machines; 
fortunately, they are rare. 

REJECT does not rescan the input; instead it remembers the 
results of the previous scan. This means that if a rule with 
trailing context is found and REJECT executed, the user must 
not have used unput to change the characters forthcoming from 
the input stream. This is the only restriction on the user's 
ability to manipulate the not-yet-processed input. 
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Chapter 22 

YET ANOTHER COMPILER- 
COMPILER— "yacc" 



GENERAL 

The yacc program provides a general tool for imposing 
structure on the input to a computer program. The yacc user 
prepares a specification of the input process. This includes rules 
describing the input structure, code to be invoked when these 
rules are recognized, and a low-level routine to do the basic 
input. The yacc program then generates a function to control 
the input process. This function, called a parser, calls the 
user-supplied low-level input routine (the lexical analyzer) to 
pick up the basic items (called tokens) from the input stream. 
These tokens are organized according to the input structure 
rules, called grammar rules. When one of these rules has been 
recognized, then user code (supplied for this rule, an action) is 
invoked. Actions have the ability to return values and make use 
of the values of other actions. 

The yacc program is written in a portable dialect of the C 
language, and the actions and output subroutine are in the C 
language as well. Moreover, many of the syntactic conventions 
of yacc follow the C language. 

The heart of the input specification is a collection of grammar 
rules. Each rule describes an allowable structure and gives it a 
name. For example, one grammar rule might be 

date : month name day ',' year ; 

where "date", "month_name", "day", and "year" represent 
structures of interest in the input process; presumably, "month 
name", "day", and "year" are defined elsewhere. The comma 

22-1 



YACC 



is enclosed in single quotes. This implies that the comma is to 
appear literally in the input. The colon and semicolon merely 
serve as punctuation in the rule and have no significance in 
controlling the input. With proper definitions, the input 

July 4, 1776 

might be matched by the rule. 

An important part of the input process is carried out by the 
lexical analyzer. This user routine reads the input stream, 
recognizes the lower-level structures, and communicates these 
tokens to the parser. For historical reasons, a structure 
recognized by the lexical analyzer is called a "terminal symbol", 
while the structure recognized by the parser is called a 
"nonterminal symbol". To avoid confusion, terminal symbols 
will usually be referred to as "tokens". 

There is considerable leeway in deciding whether to recognize 
structures using the lexical analyzer or grammar rules. For 
example, the rules 



month name : 'F' 'e' 'b' 



month_name : 'D' 'e' 'c' ; 

might be used in the above example. The lexical analyzer only 
needs to recognize individual letters, and "month name" is a 
nonterminal symbol. Such low-level rules tend to waste time 
and space and may complicate the specification beyond the 
ability of yacc to deal with it. Usually, the lexical analyzer 
recognizes the month names and returns an indication that a 
"month name" is seen. In this case, "month name" is a "token". 
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Literal characters such as a comma must also be passed 
through the lexical analyzer and are also considered tokens. 

Specification files are very flexible. It is relatively easy to add 
to the above example the rule 

date : month '/' day V year ; 
allowing 

7 / 4 / 1776 
as a synonym for 

July 4, 1776 

on input. In most cases, this new rule could be "slipped in" to a 
working system with minimal effort and little danger of 
disrupting existing input. 

The input being read may not conform to the specifications. 
These input errors are detected as early as is theoretically 
possible with a left-to-right scan. Thus, not only is the chance 
of reading and computing with bad input data substantially 
reduced, but the bad data can usually be quickly found. Error 
handling, provided as part of the input specifications, permits 
the reentry of bad data or the continuation of the input process 
after skipping over the bad data. 

In some cases, yacc fails to produce a parser when given a set 
of specifications. For example, the specifications may be self- 
contradictory, or they may require a more powerful recognition 
mechanism than that available to yacc. The former cases 
represent design errors; the latter cases can often be corrected 
by making the lexical analyzer more powerful or by rewriting 
some of the grammar rules. While yacc cannot handle all 
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possible specifications, its power compares favorably with 
similar systems. Moreover, the constructions which are 
difficult for yacc to handle are also frequently difficult for 
human beings to handle. Some users have reported that the 
discipline of formulating valid yacc specifications for their 
input revealed errors of conception or design early in the 
program development. 

The yacc program has been extensively used in numerous 
practical applications, including lint, the Portable C Compiler, 
and a system for typesetting mathematics. 

The remainder of this document describes the following 
subjects as they relate to yacc: 

• Basic process of preparing a yacc specification 

• Parser operation 

• Handling ambiguities 

• Handling operator precedences in arithmetic expressions 

• Error detection and recovery 

• The operating environment and special features of the 
parsers yacc produces 

• Suggestions to improve the style and efficiency of the 
specifications 

• Advanced topics. 

In addition, there are four appendices. Appendix 1 is a brief 
example, and Appendix 2 is a summary of the yacc input 
syntax. Appendix 3 gives an example using some of the more 
advanced features of yacc, and Appendix 4 describes 
mechanisms and syntax no longer actively supported but 
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provided for historical continuity with older versions of yacc. 



BASIC SPECIFICATIONS 

Names refer to either tokens or nonterminal symbols. The 
yacc program requires token names to be declared as such. In 
addition, it is often desirable to include the lexical analyzer as 
part of the specification file. It may be useful to include other 
programs as well. Thus, every specification file consists of 
three sections: the declarations, (grammar) rules, and 
programs. The sections are separated by double percent (%%) 
marks. (The percent symbol is generally used in yacc 
specifications as an escape character.) 

In other words, a full specification file looks like 

declarations 

%% 

rules 

%% 

programs 

when each section is used. 



The declaration section may be empty, and if the programs 
section is omitted, the second %% mark may also be omitted. 
The smallest legal yacc specification is 

%% 
rules 

since the other two sections may be omitted. 



22-5 



YACC 



Blanks, tabs, and newlines are ignored, but they may not 
appear in names or multicharacter reserved symbols. 
Comments may appear wherever a name is legal. They are 
enclosed in /* ... */, as in C language. 

The rules section is made up of one or more grammar rules. A 
grammar rule has the form 

A : BODY ; 

where "A" represents a nonterminal name, and "BODY" 
represents a sequence of zero or more names and literals. The 
colon and the semicolon are yacc punctuation. 

Names may be of arbitrary length and may be made up of 
letters, dots, underscores, and noninitial digits. Uppercase and 
lowercase letters are distinct. The names used in the body of a 
grammar rule may represent tokens or nonterminal symbols. 

A literal consists of a character enclosed in single quotes ('). 
As in C language, the backslash (\) is an escape character 
within literals, and all the C language escapes are recognized. 
Thus: 



'\n' 


newline 


'\r' 


return 


'V 


single quote ( * ) 


'W 


backslash ( \ ) 


'\t' 


tab 


'\b' 


backspace 


'\f 


form feed 


'\xxx 


' " xxx" in octal 



are understood by yacc. For a number of technical reasons, 
the NUL character ('\0' or 0) should never be used in grammar 
rules. 
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If there are several grammar rules with the same left-hand 
side, the vertical bar ( | ) can be used to avoid rewriting the 
left-hand side. In addition, the semicolon at the end of a rule 
can be dropped before a vertical bar. Thus the grammar rules: 



A 


: B C D 


A 


: E F ; 


A 


: G ; 



can be given to yacc as: 

A : B C D 
I E F 
I G 



by using the vertical bar. It is not necessary that all grammar 
rules with the same left side appear together in the grammar 
rules section although it makes the input much more readable 
and easier to change. 

If a nonterminal symbol matches the empty string, this can be 
indicated by: 

empty : ; 

which is understood by yacc. 

Names representing tokens must be declared. This is most 
simply done by writing: 

% token namel name2 ... 

in the declarations section. Every name not defined in the 
declarations section is assumed to represent a nonterminal 
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symbol. Every nonterminal symbol must appear on the left 
side of at least one rule. 



Of all the nonterminal symbols, the start symbol has particular 
importance. The parser is designed to recognize the start 
symbol. Thus, this symbol represents the largest, most general 
structure described by the grammar rules. By default, the start 
symbol is taken to be the left-hand side of the first grammar 
rule in the rules section. It is possible and desirable to declare 
the start symbol explicitly in the declarations section using the 
% start keyword 

% start symbol 

to define the start symbol. 

The end of the input to the parser is signaled by a special 
token, called the end-marker. If the tokens up to but not 
including the end-marker form a structure that matches the 
start symbol, the parser function returns to its caller after the 
end-marker is seen and accepts the input. If the end-marker is 
seen in any other context, it is an error. 

It is the job of the user-supplied lexical analyzer to return the 
end-marker when appropriate. Usually the end-marker 
represents some reasonably obvious I/O status, such as "end of 
file" or "end of record". 



ACTIONS 

With each grammar rule, the user may associate actions to be 
performed each time the rule is recognized in the input process. 
These actions may return values and may obtain the values 
returned by previous actions. Moreover, the lexical analyzer 
can return values for tokens if desired. 
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An action is an arbitrary C language statement and as such can 
do input and output, call subprograms, and alter external 
vectors and variables. An action is specified by one or more 
statements enclosed in curly braces ({) and (} ). For example: 



A : '(' B ')' 

{ 
hello( 1, " abc" ); 

} 



and 



XXX : YYY ZZZ 

{ 
printf(" a message\n" ); 
flag = 25; 

} 
are grammar rules with actions. 

To facilitate easy communication between the actions and the 
parser, the action statements are altered slightly. The dollar 
sign symbol ($) is used as a signal to yacc in this context. 

To return a value, the action normally sets the pseudo-variable 
$$ to some value. For example, the action 

{ $$ = 1; } 

does nothing but return the value of one. 

To obtain the values returned by previous actions and the 
lexical analyzer, the action may use the pseudo-variables $1, 
$2, ..., which refer to the values returned by the components of 
the right side of a rule, reading from left to right. If the rule is 



22-9 



YACC 

A : B C D ; 

then $2 has the value returned by C, and $3 the value returned 
byD. 

The rule 

expr : '(' expr ')' ; 

provides a more concrete example. The value returned by this 
rule is usually the value of the "expr" in parentheses. This can 
be indicated by 

expr : '(' expr ')' 

{ 

$$ — $2 ; 

) 

By default, the value of a rule is the value of the first element 
in it ($1). Thus, grammar rules of the form 

A : B ; 

frequently need not have an explicit action. 

In the examples above, all the actions came at the end of rules. 
Sometimes, it is desirable to get control before a rule is fully 
parsed. The yacc permits an action to be written in the middle 
of a rule as well as at the end. This rule is assumed to return a 
value accessible through the usual $ mechanism by the actions 
to the right of it. In turn, it may access the values returned by 
the symbols to its left. Thus, in the rule 
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: B 

{ 

$$=1; 

} 
c 

{ 

x = $2; 
y = $3; 
} 



the effect is to set a:; to 1 and y to the value returned by C. 

Actions that do not terminate a rule are actually handled by 
yacc by manufacturing a new nonterminal symbol name and a 
new rule matching this name to the empty string. The interior 
action is the action triggered off by recognizing this added rule. 
The yacc program actually treats the above example as if it 
had been written 



$ACT 
{ 

} 

> 


: /* empty */ 

$$ = 1; 


A : 

{ 

} 


B $ACT C 

x = $2; 

y = $3; 



where $ACT is an empty action. 

In many applications, output is not done directly by the actions. 
A data structure, such as a parse tree, is constructed in 
memory and transformations are applied to it before output is 
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generated. Parse trees are particularly easy to construct given 
routines to build and maintain the tree structure desired. For 
example, suppose there is a C function node written so that the 
call 

node( L, nl, n2 ) 

creates a node with label L and descendants nl and n2 and 
returns the index of the newly created node. Then parse tree 
can be built by supplying actions such as 

expr : expr '+' expr 

{ 

$$ = node( '+', $1, $3 ); 

} 
in the specification. 

The user may define other variables to be used by the actions. 
Declarations and definitions can appear in the declarations 
section enclosed in the marks %{ and %}. These declarations 
and definitions have global scope, so they are known to the 
action statements and the lexical analyzer. For example: 

% { int variable = 0; % } 

could be placed in the declarations section making "variable" 
accessible to all of the actions. The yacc parser uses only 
names beginning with yy. The user should avoid such names. 

In these examples, all the values are integers. A discussion of 
values of other types is found in the part "ADVANCED 
TOPICS". 
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LEXICAL ANALYSIS 

The user must supply a lexical analyzer to read the input 
stream and communicate tokens (with values, if desired) to the 
parser. The lexical analyzer is an integer-valued function called 
yylex. The function returns an integer, the token number, 
representing the kind of token read. If there is a value 
associated with that token, it should be assigned to the external 
variable yylval. 

The parser and the lexical analyzer must agree on these token 
numbers in order for communication between them to take 
place. The numbers may be chosen by yacc or the user. In 
either case, the #define mechanism of C language is used to 
allow the lexical analyzer to return these numbers symbolically. 
For example, suppose that the token name DIGIT has been 
defined in the declarations section of the yacc specification 
file. The relevant portion of the lexical analyzer might look 
like: 

yylexO 

{ 
extern int yylval; 

int c; 

c = getcharO; 

switch( c ) 

{ 

case '0': 
case '1': 

case '9': 
yylval = c-'O'; 
return( DIGIT ); 

} "" 
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to return the appropriate token. 

The intent is to return a token number of DIGIT and a value 
equal to the numerical value of the digit. Provided that the 
lexical analyzer code is placed in the programs section of the 
specification file, the identifier DIGIT is defined as the token 
number associated with the token DIGIT. 

This mechanism leads to clear, easily modified lexical 
analyzers. The only pitfall to avoid is using any token names in 
the grammar that are reserved or significant in C language or 
the parser. For example, the use of token names if or while 
will almost certainly cause severe difficulties when the lexical 
analyzer is compiled. The token name error is reserved for 
error handling and should not be used naively. 

As mentioned above, the token numbers may be chosen by yacc 
or the user. In the default situation, the numbers are chosen 
by yacc. The default token number for a literal character is 
the numerical value of the character in the local character set. 
Other names are assigned token numbers starting at 257. 

To assign a token number to a token (including literals), the 
first appearance of the token name or literal in the declarations 
section can be immediately followed by a nonnegative integer. 
This integer is taken to be the token number of the name or 
literal. Names and literals not defined by this mechanism 
retain their default definition. It is important that all token 
numbers be distinct. 

For historical reasons, the end-marker must have token number 
or negative. This token number cannot be redefined by the 
user. Thus, all lexical analyzers should be prepared to return 
or a negative number as a token upon reaching the end of their 
input. 
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A very useful tool for constructing lexical analyzers is the lex 
program. These lexical analyzers are designed to work in close 
harmony with yacc parsers. The specifications for these 
lexical analyzers use regular expressions instead of grammar 
rules. Lex can be easily used to produce quite complicated 
lexical analyzers, but there remain some languages (such as 
FORTRAN) which do not fit any theoretical framework and 
whose lexical analyzers must be crafted by hand. 



PARSER OPERATION 

The yacc program turns the specification file into a C language 
program, which parses the input according to the specification 
given. The algorithm used to go from the specification to the 
parser is complex and will not be discussed here. The parser 
itself, however, is relatively simple and understanding how it 
works will make treatment of error recovery and ambiguities 
much more comprehensible. 

The parser produced by yacc consists of a finite state machine 
with a stack. The parser is also capable of reading and 
remembering the next input token (called the look-ahead 
token). The current state is always the one on the top of the 
stack. The states of the finite state machine are given small 
integer labels. Initially, the machine is in state (the stack 
contains only state 0) and no look-ahead token has been read. 

The machine has only four actions available— s/ii/i^, reduce, 
accept, and error. A step of the parser is done as follows: 

1. Based on its current state, the parser decides if it needs a 
look-ahead token to choose the action to be taken. If it 
needs one and does not have one, it calls yylex to obtain 
the next token. 
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2. Using the current state and the look-ahead token if 
needed, the parser decides on its next action and carries 
it out. This may result in states being pushed onto the 
stack or popped off of the stack and in the look-ahead 
token being processed or left alone. 

The shift action is the most common action the parser takes. 
Whenever a shift action is taken, there is always a look-ahead 
token. For example, in state 56 there may be an action 

IF shift 34 

which says, in state 56, if the look-ahead token is IF, the 
current state (56) is pushed down on the stack, and state 34 
becomes the current state (on the top of the stack). The look- 
ahead token is cleared. 

The reduce action keeps the stack from growing without 
bounds. Reduce actions are appropriate when the parser has 
seen the right-hand side of a grammar rule and is prepared to 
announce that it has seen an instance of the rule replacing the 
right-hand side by the left-hand side. It may be necessary to 
consult the look-ahead token to decide whether to reduce or not 
(usually it is not necessary). In fact, the default action 
(represented by a dot) is often a reduce action. 

Reduce actions are associated with individual grammar rules. 
Grammar rules are also given small integer numbers, and this 
leads to some confusion. The action 

. reduce 18 

refers to grammar rule 18, while the action 

IF shift 34 
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refers to state 34. 
Suppose the rule 

A : X y z ; 

is being reduced. The reduce action depends on the left-hand 
symbol (A in this case) and the number of symbols on the 
right-hand side (three in this case). To reduce, first pop off the 
top three states from the stack. (In general, the number of 
states popped equals the number of symbols on the right side of 
the rule.) In effect, these states were the ones put on the stack 
while recognizing x, y, and z and no longer serve any useful 
purpose. After popping these states, a state is uncovered which 
was the state the parser was in before beginning to process the 
rule. Using this uncovered state and the symbol on the left side 
of the rule, perform what is in effect a shift of A. A new state 
is obtained, pushed onto the stack, and parsing continues. 
There are significant differences between the processing of the 
left-hand symbol and an ordinary shift of a token, however, so 
this action is called a goto action. In particular, the look-ahead 
token is cleared by a shift but is not affected by a goto. In any 
case, the uncovered state contains an entry such as 

A goto 20 

causing state 20 to be pushed onto the stack and become the 
current state. 

In effect, the reduce action "turns back the clock" in the parse 
popping the states off the stack to go back to the state where 
the right-hand side of the rule was first seen. The parser then 
behaves as if it had seen the left side at that time. If the 
right-hand side of the rule is empty, no states are popped off of 
the stacks. The uncovered state is in fact the current state. 
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The reduce action is also important in the treatment of user- 
supplied actions and values. When a rule is reduced, the code 
supplied with the rule is executed before the stack is adjusted. 
In addition to the stack holding the states, another stack 
running in parallel with it holds the values returned from the 
lexical analyzer and the actions. When a shift takes place, the 
external variable "yylval" is copied onto the value stack. After 
the return from the user code, the reduction is carried out. 
When the goto action is done, the external variable "yyval" is 
copied onto the value stack. The pseudo-variables $1, $2, etc., 
refer to the value stack. 

The other two parser actions are conceptually much simpler. 
The accept action indicates that the entire input has been seen 
and that it matches the specification. This action appears only 
when the look-ahead token is the end-marker and indicates that 
the parser has successfully done its job. The error action, on 
the other hand, represents a place where the parser can no 
longer continue parsing according to the specification. The 
input tokens it has seen (together with the look-ahead token) 
cannot be followed by anything that would result in a legal 
input. The parser reports an error and attempts to recover the 
situation and resume parsing. The error recovery (as opposed to 
the detection of error) will be discussed later. 

Consider: 



% token 


DING DONG DELL 


%% 




rhyme 


: sound place 


sound : 


DING DONG 


place : 


DELL 



as a yacc specification. 
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When yacc is invoked with the — v option, a file called y. output 
is produced with a human-readable description of the parser. 
The y. output file corresponding to the above grammar (with 
some statistics stripped off the end) is: 



state 






$accept 


. : rhyme $end 


DING 


shift 3 




. error 






rhyme 


goto 1 




sound 


goto 2 




state 1 






$accept 


: rhyme $end 


$end accept 




. error 






state 2 






rhyme 


: sound_ 


_place 


DELL 


shift 5 




. error 






place 


goto 4 




state 3 






sound 


: DING. 


_DONG 


DONG 


shift 6 




. error 






state 4 






rhyme 


: sound 


place 


. reduce 1 




state 5 







(1) 
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place : DELL_ (3) 

. reduce 3 

state 6 

sound : DING DONG_ (2) 

. reduce 2 

where the actions for each state are specified and there is a 
description of the parsing rules being processed in each state. 
The _ character is used to indicate what has been seen and 
what is yet to come in each rule. The following input 

DING DONG DELL 

can be used to track the operations of the parser. Initially, the 
current state is state 0. The parser needs to refer to the input 
in order to decide between the actions available in state 0, so 
the first token, DING, is read and becomes the look-ahead 
token. The action in state on DING is shift 3, state 3 is 
pushed onto the stack, and the look-ahead token is cleared. 
State 3 becomes the current state. The next token, DONG, is 
read and becomes the look-ahead token. The action in state 3 
on the token DONG is shift 6, state 6 is pushed onto the stack, 
and the look-ahead is cleared. The stack now contains 0, 3, and 
6. In state 6, without even consulting the look-ahead, the 
parser reduces by 

sound : DING DONG 

which is rule 2. Two states, 6 and 3, are popped off of the stack 
uncovering state 0. Consulting the description of state 
(looking for a goto on sound), 

sound goto 2 
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is obtained. State 2 is pushed onto the stack and becomes the 
current state. 



In state 2, the next token, DELL, must be read. The action is 
shift 5, so state 5 is pushed onto the stack, which now has 0, 2, 
and 5 on it, and the look-ahead token is cleared. In state 5, the 
only action is to reduce by rule 3. This has one symbol on the 
right-hand side, so one state, 5, is popped off, and state 2 is 
uncovered. The goto in state 2 on place (the left side of rule 3) 
is state 4. Now, the stack contains 0, 2, and 4. In state 4, the 
only action is to reduce by rule 1. There are two symbols on 
the right, so the top two states are popped off, uncovering state 
again. In state 0, there is a goto on rhyme causing the parser 
to enter state 1. In state 1, the input is read and the end- 
marker is obtained indicated by $end in the y. output file. The 
action in state 1 (when the end-marker is seen) successfully 
ends the parse. 

The reader is urged to consider how the parser works when 
confronted with such incorrect strings as DING DONG DONG, 
DING DONG, DING DONG DELL DELL, etc. A few minutes 
spent with this and other simple examples is repaid when 
problems arise in more complicated contexts. 



AMBIGUITY AND CONFLICTS 

A set of grammar rules is ambiguous if there is some input 
string that can be structured in two or more different ways. 
For example, the grammar rule 



expr : expr - expr 

is a natural way of expressing the fact that one way of forming 
an arithmetic expression is to put two other expressions 
together with a minus sign between them. Unfortunately, this 
grammar rule does not completely specify the way that all 
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complex inputs should be structured. For example, if the input 
is 

expr - expr - expr 
the rule allows this input to be structured as either 

( expr - expr ) - expr 
or as 

expr - ( expr - expr ) 

(The first is called "left association", the second "right 
association".) 

The yacc program detects such ambiguities when it is 
attempting to build the parser. Given the input 

expr - expr - expr 

consider the problem that confronts the parser. When the 
parser has read the second expr, the input seen 

expr - expr 

matches the right side of the grammar rule above. The parser 
could reduce the input by applying this rule. After applying 
the rule, the input is reduced to "expr" (the left side of the 
rule). The parser would then read the final part of the input 

- expr 
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and again reduce. The effect of this is to take the left 
associative interpretation. 

Alternatively, if the parser sees 

expr - expr 

it could defer the immediate application of the rule and 
continue reading the input until 

expr - expr - expr 

is seen. It could then apply the rule to the rightmost three 
symbols reducing them to "expr" which results in 

expr - expr 

being left. Now the rule can be reduced once more. The effect 
is to take the right associative interpretation. Thus, having 
read 

expr - expr 

the parser can do one of two legal things, a shift or a reduction. 
It has no way of deciding between them. This is called a 
"shift/reduce conflict". It may also happen that the parser has 
a choice of two legal reductions. This is called a "reduce/reduce 
conflict". Note that there are never any shift/shift conflicts. 

When there are shift/reduce or reduce/reduce conflicts, yacc 
still produces a parser. It does this by selecting one of the valid 
steps wherever it has a choice. A rule describing the choice to 
make in a given situation is called a "disambiguating rule". 
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The yacc program invokes two disambiguating rules by 
default: 



1. In a shift/reduce conflict, the default is to do the shift. 

2. In a reduce/reduce conflict, the default is to reduce by 
the earlier grammar rule (in the input sequence). 

Rule 1 implies that reductions are deferred in favor of shifts 
when there is a choice. Rule 2 gives the user rather crude 
control over the behavior of the parser in this situation, but 
reduce/reduce conflicts should be avoided when possible. 

Conflicts may arise because of mistakes in input or logic or 
because the grammar rules (while consistent) require a more 
complex parser than yacc can construct. The use of actions 
within rules can also cause conflicts if the action must be done 
before the parser can be sure which rule is being recognized. In 
these cases, the application of disambiguating rules is 
inappropriate and leads to an incorrect parser. For this reason, 
yacc always reports the number of shift/reduce and 
reduce/reduce conflicts resolved by Rule 1 and Rule 2, 

In general, whenever it is possible to apply disambiguating 
rules to produce a correct parser, it is also possible to rewrite 
the grammar rules so that the same inputs are read but there 
are no conflicts. For this reason, most previous parser 
generators have considered conflicts to be fatal errors. Our 
experience has suggested that this rewriting is somewhat 
unnatural and produces slower parsers. Thus, yacc will 
produce parsers even in the presence of conflicts. 

As an example of the power of disambiguating rules, consider 
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Stat : IF '(' cond ')' stat 

I IF '(' cond ')' Stat ELSE stat 



which is a fragment from a programming language involving 
an "if-then-else" statement. In these rules, "IF" and "ELSE" 
are tokens, "cond" is a nonterminal symbol describing 
conditional (logical) expressions, and "stat" is a nonterminal 
symbol describing statements. The first rule will be called the 
"simple-if" rule and the second the "if -else" rule. 

These two rules form an ambiguous construction since input of 
the form 

IF ( CI ) IF ( C2 ) SI ELSE S2 

can be structured according to these rules in two ways 

IF ( CI ) 

{ 

IF (C2) 

SI 

} 
ELSE 

S2 



or 

IF ( CI ) 

{ 

IF ( C2 ) 

SI 

ELSE 

S2 

} 
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where the second interpretation is the one given in most 
programming languages having this construct. Each "ELSE" is 
associated with the last preceding "un-ELSE'd" IF. In this 
example, consider the situation where the parser has seen 

IF ( CI ) IF ( C2 ) SI 

and is looking at the "ELSE". It can immediately reduce by 
the simple-if rule to get 

IF ( CI ) Stat 
and then read the remaining input 

ELSE S2 
and reduce 

IF ( CI ) Stat ELSE S2 

by the if-else rule. This leads to the first of the above 
groupings of the input. 

On the other hand, the "ELSE" may be shifted, "S2" read, and 
then the right-hand portion of 

IF ( CI ) IF ( C2 ) SI ELSE S2 

can be reduced by the if-else rule to get 

IF ( CI ) Stat 

which can be reduced by the simple-if rule. This leads to the 
second of the above groupings of the input which is usually 

22-26 



YACC 



desired. 

Once again, the parser can do two valid things— there is a 
shift/reduce conflict. The application of disambiguating rule 1 
tells the parser to shift in this case, which leads to the desired 
grouping. 

This shift/reduce conflict arises only when there is a particular 
current input symbol, "ELSE", and particular inputs, such as 

IF ( CI ) IF ( C2 ) SI 

have already been seen. In general, there may be many 
conflicts, and each one will be associated with an input symbol 
and a set of previously read inputs. The previously read inputs 
are characterized by the state of the parser. 

The conflict messages of yacc are best understood by 
examining the verbose (— v) option output file. For example, 
the output corresponding to the above conflict state might be 

23: shift/reduce conflict (shift 45, reduce 18) on ELSE 

state 23 

Stat : IF ( cond ) stat_ (18) 

Stat : IF ( cond ) stat_ELSE stat 

ELSE shift 45 
reduce 18 

where the first line describes the conflict— giving the state and 
the input symbol. The ordinary state description gives the 
grammar rules active in the state and the parser actions. 
Recall that the underline marks the portion of the grammar 
rules which has been seen. Thus in the example, in state 23 the 
parser has seen input corresponding to 
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IF ( cond ) stat 

and the two grammar rules shown are active at this time. The 
parser can do two possible things. If the input symbol is 
"ELSE", it is possible to shift into state 45. State 45 will have, 
as part of its description, the line 

Stat : IF ( cond ) stat ELSE_stat 

since the "ELSE" will have been shifted in this state. In state 
23, the alternative action [describing a dot (.)] is to be done if 
the input symbol is not mentioned explicitly in the actions. In 
this case, if the input symbol is not "ELSE", the parser reduces 
to 

stat : IF '(' cond ')' stat 

by grammar rule 18. 

Once again, notice that the numbers following "shift" 
commands refer to other states, while the numbers following 
"reduce" commands refer to grammar rule numbers. In the 
y. output file, the rule numbers are printed after those rules 
which can be reduced. In most one states, there is reduce 
action possible in the state and this is the default command. 
The user who encounters unexpected shift/reduce conflicts will 
probably want to look at the verbose output to decide whether 
the default actions are appropriate. 
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PRECEDENCE 

There is one common situation where the rules given above for 
resolving conflicts are not sufficient. This is in the parsing of 
arithmetic expressions. Most of the commonly used 
constructions for arithmetic expressions can be naturally 
described by the notion of precedence levels for operators, 
together with information about left or right associativity. It 
turns out that ambiguous grammars with appropriate 
disambiguating rules can be used to create parsers that are 
faster and easier to write than parsers constructed from 
unambiguous grammars. The basic notion is to write grammar 
rules of the form 

expr : expr OP expr 

and 

expr : UNARY expr 

for all binary and unary operators desired. This creates a very 
ambiguous grammar with many parsing conflicts. As 
disambiguating rules, the user specifies the precedence or 
binding strength of all the operators and the associativity of 
the binary operators. This information is sufficient to allow 
yacc to resolve the parsing conflicts in accordance with these 
rules and construct a parser that realizes the desired 
precedences and associativities. 

The precedences and associativities are attached to tokens in 
the declarations section. This is done by a series of lines 
beginning with a yacc keyword: %left, % right, or 
%nonassoc, followed by a list of tokens. All of the tokens on 
the same line are assumed to have the same precedence level 
and associativity; the lines are listed in order of increasing 
precedence or binding strength. Thus: 
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%left '+' '-' 
%left '*' V 

describes the precedence and associativity of the four 
arithmetic operators. Plus and minus are left associative and 
have lower precedence than star and slash, which are also left 
associative. The keyword % right is used to describe right 
associative operators, and the keyword %nonassoc is used to 
describe operators, like the operator .LT. in FORTRAN, that 
may not associate with themselves. Thus: 

A .LT. B .LT. C 

is illegal in FORTRAN and such an operator would be described 
with the keyword %nonassoc in yacc. As an example of the 
behavior of these declarations, the description 

% right '=' 

%left '+' '-' 
%left '*' '/' 

%% 

expr : expr '=' expr 



I expr '+' expr 

I expr '-' expr 

I expr '*' expr 

I expr V expr 

I NAME 



might be used to structure the input 

a = b = c*d - e - f*g 

as follows 
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a = ( b = ( ((c*d)-e) - (f*g) ) ) 

in order to perform the correct precedence of operators. When 
this mechanism is used, unary operators must, in general, be 
given a precedence. Sometimes a unary operator and a binary 
operator have the same symbolic representation but different 
precedences. An example is unary and binary "— ". Unary 
minus may be given the same strength as multiplication, or 
even higher, while binary minus has a lower strength than 
multiplication. The keyword, %prec, changes the precedence 
level associated with a particular grammar rule. The keyword 
%prec appears immediately after the body of the grammar 
rule, before the action or closing semicolon, and is followed by a 
token name or literal. It causes the precedence of the grammar 
rule to become that of the following token name or literal. For 
example, the rules 

%left '+' '-' 
%left '*' V 

%% 

expr : expr '+' expr 
expr '-' expr 
expr '*' expr 
expr V expr 
'-' expr %prec '*' 
NAME 



might be used to give unary minus the same precedence as 
multiplication. 

A token declared by %left, %right, and %nonassoc need not 
be, but may be, declared by % token as well. 

The precedences and associativities are used by yacc to resolve 
parsing conflicts. They give rise to disambiguating rules. 
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Formally, the rules work as follows: 

1. The precedences and associativities are recorded for 
those tokens and literals that have them. 

2. A precedence and associativity is associated with each 
grammar rule. It is the precedence and associativity of 
the last token or literal in the body of the rule. If the 
%prec construction is used, it overrides this default. 
Some grammar rules may have no precedence and 
associativity associated with them. 

3. When there is a reduce/reduce conflict or there is a 
shift/reduce conflict and either the input symbol or the 
grammar rule has no precedence and associativity, then 
the two disambiguating rules given at the beginning of 
the section are used, and the conflicts are reported. 

4. If there is a shift/reduce conflict and both the grammar 
rule and the input character have precedence and 
associativity associated with them, then the conflict is 
resolved in favor of the action (shift or reduce) 
associated with the higher precedence. If the 
precedences are the same, then the associativity is used; 
left associative implies reduce, right associative implies 
shift, and nonassociating implies error. 

Conflicts resolved by precedence are not counted in the number 
of shift/reduce and reduce/reduce conflicts reported by yacc. 
This means that mistakes in the specification of precedences 
may disguise errors in the input grammar. It is a good idea to 
be sparing with precedences and use them in an essentially 
"cookbook" fashion until some experience has been gained. The 
y. output file is very useful in deciding whether the parser is 
actually doing what was intended. 
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ERROR HANDLING 

Error handling is an extremely difficult area, and many of the 
problems are semantic ones. When an error is found, for 
example, it may be necessary to reclaim parse tree storage, 
delete or alter symbol table entries, and, typically, set switches 
to avoid generating any further output. 

It is seldom acceptable to stop all processing when an error is 
found. It is more useful to continue scanning the input to find 
further syntax errors. This leads to the problem of getting the 
parser "restarted" after an error. A general class of 
algorithms to do this involves discarding a number of tokens 
from the input string and attempting to adjust the parser so 
that input can continue. 

To allow the user some control over this process, yacc provides 
a simple, but reasonably general feature. The token name 
"error" is reserved for error handling. This name can be used 
in grammar rules. In effect, it suggests places where errors are 
expected and recovery might take place. The parser pops its 
stack until it enters a state where the token "error" is legal. It 
then behaves as if the token "error" were the current look- 
ahead token and performs the action encountered. The look- 
ahead token is then reset to the token that caused the error. If 
no special error rules have been specified, the processing halts 
when an error is detected. 

In order to prevent a cascade of error messages, the parser, 
after detecting an error, remains in error state until three 
tokens have been successfully read and shifted. If an error is 
detected when the parser is already in error state, no message 
is given, and the input token is quietly deleted. 

As an example, a rule of the form 

stat : error 
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means that on a syntax error the parser attempts to skip over 
the statement in which the error is seen. More precisely, the 
parser scans ahead, looking for three tokens that might legally 
follow a statement, and starts processing at the first of these. 
If the beginnings of statements are not sufficiently distinctive, 
it may make a false start in the middle of a statement and end 
up reporting a second error where there is in fact no error. 

Actions may be used with these special error rules. These 
actions might attempt to reinitialize tables, reclaim symbol 
table space, etc. 

Error rules such as the above are very general but difficult to 
control. Rules such as 



stat : error ';' 

are somewhat easier. Here, when there is an error, the parser 
attempts to skip over the statement but does so by skipping to 
the next semicolon. All tokens after the error and before the 
next semicolon cannot be shifted and are discarded. When the 
semicolon is seen, this rule will be reduced and any "cleanup" 
action associated with it performed. 

Another form of error rule arises in interactive applications 
where it may be desirable to permit a line to be reentered after 
an error. The following example is one way to do this: 
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input : error '\n' 

{ 
printf( " Reenter last line: " ); 

} 
input 

{ 



} 



There is one potential difficulty with this approach. The parser 
must correctly process three input tokens before it admits that 
it has correctly resynchronized after the error. If the reentered 
line contains an error in the first two tokens, the parser deletes 
the offending tokens and gives no message. This is clearly 
unacceptable. For this reason, there is a mechanism that can 
force the parser to believe that error recovery has been 
accomplished. The statement 

yyerrok ; 

in an action resets the parser to its normal mode. The last 
example can be rewritten as 

input : error '\n' 

{ 
yyerrok; 

printf( " Reenter last line: " ); 

} 
input 



= $4; 



which is somewhat better. 
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As previously mentioned, the token seen immediately after the 
"error" symbol is the input token at which the error was 
discovered. Sometimes, this is inappropriate; for example, an 
error recovery action might take upon itself the job of finding 
the correct place to resume input. In this case, the previous 
look-ahead token must be cleared. The statement 

yyclearin ; 

in an action will have this effect. For example, suppose the 
action after error were to call some sophisticated 
resynchronization routine (supplied by the user) that attempted 
to advance the input to the beginning of the next valid 
statement. After this routine is called, the next token returned 
by yylex is presumably the first token in a legal statement. 
The old illegal token must be discarded and the error state 
reset. A rule similar to 



stat : error 

{ 
resynchO; 

yyerrok ; 

yyclearin; 

} 



could perform this. 

These mechanisms are admittedly crude but do allow for a 
simple, fairly effective recovery of the parser from many errors. 
Moreover, the user can get control to deal with the error 
actions required by other portions of the program. 
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THE "yacc" ENVIRONMENT 

When the user inputs a specification to yace, the output is a 
file of C language programs, called y.tab.c on most systems. 
(Due to local file system conventions, the names may differ 
from installation to installation.) The function produced by 
yacc is called yyparseQ; it is an integer valued function. When 
it is called, it in turn repeatedly calls yylexQ, the lexical 
analyzer supplied by the user (see "LEXICAL ANALYSIS"), to 
obtain input tokens. Eventually, an error is detected, yyparseQ 
returns the value 1, and no error recovery is possible, or the 
lexical analyzer returns the end-marker token and the parser 
accepts. In this case, yyparseQ returns the value 0. 

The user must provide a certain amount of environment for this 
parser in order to obtain a working program. For example, as 
with every C language program, a program called mainQ must 
be defined that eventually calls yyparseQ. In addition, a 
routine called yyerrorQ prints a message when a syntax error 
is detected. 



These two routines must be supplied in one form or another by 
the user. To ease the initial effort of using yacc, a library has 
been provided with default versions of mainQ and yyerrorQ. 
The name of this library is system dependent; on many 
systems, the library is accessed by a — ly argument to the 
loader. The source codes 

main() 

{ 

return ( yyparse() ); 

} 



and 
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# include <stdio.h> 

yyerror(s) 

char *s; 

{ 

fprintf ( stderr, " % s\n" , s ); 

} 

show the triviality of these default programs. The argument to 
yy error is a string containing an error message, usually the 
string "syntax error". The average application wants to do 
better than this. Ordinarily, the program should keep track of 
the input line number and print it along with the message 
when a syntax error is detected. The external integer variable 
yychar contains the look-ahead token number at the time the 
error was detected. This may be of some interest in giving 
better diagnostics. Since the mainQ program is probably 
supplied by the user (to read arguments, etc.), the yacc library 
is useful only in small projects or in the earliest stages of 
larger ones. 

The external integer variable yydebug is normally set to 0. If it 
is set to a nonzero value, the parser will output a verbose 
description of its actions including a discussion of the input 
symbols read and what the parser actions are. Depending on 
the operating environment, it may be possible to set this 
variable by using a debugging system. 



HINTS FOR PREPARING SPECIFICATIONS 

This part contains miscellaneous hints on preparing efficient, 
easy to change, and clear specifications. The individual 
subsections are more or less independent. 
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Input Style 

It is difficult to provide rules with substantial actions and still 
have a readable specification file. The following are a few style 
hints. 



1. Use all uppercase letters for token names and all 
lowercase letters for nonterminal names. This rule 
comes under the heading of "knowing who to blame when 
things go wrong". 

2. Put grammar rules and actions on separate lines. This 
allows either to be changed without an automatic need to 
change the other. 

3. Put all rules with the same left-hand side together. Put 
the left-hand side in only once and let all following rules 
begin with a vertical bar. 

4. Put a semicolon only after the last rule with a given 
left-hand side and put the semicolon on a separate line. 
This allows new rules to be added easily. 

5. Indent rule bodies by two tab stops and action bodies by 
three tab stops. 

The example in Appendix 1 is written following this style, as 
are the examples in this section (where space permits). The 
user must make up his own mind about these stylistic 
questions. The central problem, however, is to make the rules 
visible through the morass of action code. 



22-39 



YACC 

Left Recursion 

The algorithm used by the yacc parser encourages so called 
"left recursive" grammar rules. Rules of the form 

name : name rest_of_rule ; 

match this algorithm. These rules such as 

list : item 

I list ',' item 



and 



seq : item 
I seq item 



frequently arise when writing specifications of sequences and 
lists. In each of these cases, the first rule will be reduced for 
the first item only; and the second rule will be reduced for the 
second and all succeeding items. 

With right recursive rules, such as 



seq : item 
I item seq 



the parser is a bit bigger; and the items are seen and reduced 
from right to left. More seriously, an internal stack in the 
parser is in danger of overflowing if a very long sequence is 
read. Thus, the user should use left recursion wherever 
reasonable. 
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It is worth considering if a sequence with zero elements has 
any meaning, and if so, consider writing the sequence 
specification as 

seq : /* empty */ 
I seq item 



using an empty rule. Once again, the first rule would always be 
reduced exactly once before the first item was read, and then 
the second rule would be reduced once for each item read. 
Permitting empty sequences often leads to increased generality. 
However, conflicts might arise if yacc is asked to decide which 
empty sequence it has seen when it hasn't seen enough to know! 



Lexical Tie-ins 

Some lexical decisions depend on context. For example, the 
lexical analyzer might want to delete blanks normally but not 
within quoted strings, or names might be entered into a symbol 
table in declarations but not in expressions. 

One way of handling this situation is to create a global flag 
that is examined by the lexical analyzer and set by actions. For 
example. 
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%{ 

int dflag; 
%} 

... other declarations 

%% 

prog : decls stats 



decls : /* empty */ 

{ 

dflag = 1; 

} 

I decls declaration 



stats : /* empty */ 

{ 

dflag = 0; 

} 

I stats statement 



... other rules ... 

specifies a program that consists of zero or more declarations 
followed by zero or more statements. The flag "dflag" is now 
when reading statements and 1 when reading declarations, 
except for the first token in the first statement. This token 
must be seen by the parser before it can tell that the 
declaration section has ended and the statements have begun. 
In many cases, this single token exception does not affect the 
lexical scan. 

This kind of "back-door" approach can be elaborated to a 
noxious degree. Nevertheless, it represents a way of doing 
some things that are difficult if not impossible to do otherwise. 
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Reserved Words 

Some programming languages permit you to use words like "if", 
which are normally reserved as label or variable names, 
provided that such use does not conflict with the legal use of 
these names in the programming language. This is extremely 
hard to do in the framework of yacc. It is difficult to pass 
information to the lexical analyzer telling it "this instance of if 
is a keyword and that instance is a variable". The user can 
make a stab at it using the mechanism described in the last 
subsection, but it is difficult. 

A number of ways of making this easier are under advisement. 
Until then, it is better that the keywords be reserved, i.e., 
forbidden for use as variable names. There are powerful 
stylistic reasons for preferring this. 



ADVANCED TOPICS 

This part discusses a number of advanced features of yacc. 

Simulating Error and Accept in Actions 

The parsing actions of error and accept can be simulated in an 
action by use of macros YYACCEPT and YYERROR. The 
YYACCEPT macro causes yyparseQ to return the value 0; 
YYERROR causes the parser to behave as if the current input 
symbol had been a syntax error; yyerrorQ is called, and error 
recovery takes place. These mechanisms can be used to 
simulate parsers with multiple end-markers or context sensitive 
syntax checking. 
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Accessing Values in Enclosing Rules 

An action may refer to values returned by actions to the left of 
the current rule. The mechanism is simply the same as with 
ordinary actions, a dollar sign followed by a digit. 



sent : adj noun verb adj noun 

{ 

look at the sentence ... 

} 

adj ': THE 

{ 

$$ = THE; 

} 

I YOUNG 

{ 

$$ = YOUNG; 

} 



noun : DOG 

{ 

$$ = DOG; 

) 

I CRONE 

{ 

if( $0 == YOUNG ) 

{ 

printf( " what?\n" ); 

} 

$$ = CRONE; 

} 



In this case, the digit may be or negative. In the action 
following the word CRONE, a check is made that the preceding 
token shifted was not YOUNG. Obviously, this is only possible 
when a great deal is known about what might precede the 
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symbol "noun" in the input. There is also a distinctly 
unstructured flavor about this. Nevertheless, at times this 
mechanism prevents a great deal of trouble especially when a 
few combinations are to be excluded from an otherwise regular 
structure. 



Support for Arbitrary Value Types 

By default, the values returned by actions and the lexical 
analyzer are integers. The yacc program can also support 
values of other types including structures. In addition, yacc 
keeps track of the types and inserts appropriate union member 
names so that the resulting parser is strictly type checked. The 
yacc value stack is declared to be a union of the various types 
of values desired. The user declares the union and associates 
union member names to each token and nonterminal symbol 
having a value. When the value is referenced through a $$ or 
$n construction, yacc will automatically insert the appropriate 
union name so that no unwanted conversions take place. In 
addition, type checking commands such as lint are far more 
silent. 



There are three mechanisms used to provide for this typing. 
First, there is a way of defining the union. This must be done 
by the user since other programs, notably the lexical analyzer, 
must know about the union member names. Second, there is a 
way of associating a union member name with tokens and 
nonterminals. Finally, there is a mechanism for describing the 
type of those few values where yacc cannot easily determine 
the type. 

To declare the union, the user includes 
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% union 

{ 
body of union ... 

} 

in the declaration section. This declares the yacc value stack 
and the external variables yylval and yyval to have type equal 
to this union. If yacc was invoked with the — d option, the 
union declaration is copied onto the y.tab.h file. Alternatively, 
the union may be declared in a header file, and a typedef used 
to define the variable YYSTYPE to represent this union. Thus, 
the header file might have said 

typedef union 

{ 
body of union ... 

} 
YYSTYPE; 

instead. The header file must be included in the declarations 
section by use of % { and % } . 

Once YYSTYPE is defined, the union member names must be 
associated with the various terminal and nonterminal names. 
The construction 

< name > 

is used to indicate a union member name. If this follows one of 
the keywords % token, %left, % right, and %nonassoc, the 

union member name is associated with the tokens listed. Thus, 
saying 

%left <optype> '+' '-' 
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causes any reference to values returned by these two tokens to 
be tagged with the union member name optype. Another 
keyword, %type, is used to associate union member names 
with nonterminals. Thus, one might say 

%type <nodetype> expr stat 

to associate the union member nodetype with the nonterminal 
symbols "expr" and "stat". 

There remain a couple of cases where these mechanisms are 
insufficient. If there is an action within a rule, the value 
returned by this action has no a priori type. Similarly, 
reference to left context values (such as $0) leaves yacc with 
no easy way of knowing the type. In this case, a type can be 
imposed on the reference by inserting a union member name 
between < and > immediately after the first $. The example 

rule : aaa 

{ 
$<intval>$ = 3; 

} 
bbb 

{ 

fun( $<intval>2, $<other>0 ); 

} 



shows this usage. This syntax has little to recommend it, but 
the situation arises rarely. 

A sample specification is given in Appendix 3. The facilities in 
this subsection are not triggered until they are used. In 
particular, the use of %type will turn on these mechanisms. 
When they are used, there is a fairly strict level of checking. 
For example, use of $n or $$ to refer to something with no 
defined type is diagnosed. If these facilities are not triggered, 
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the yacc value stack is used to hold int% as was true 
historically. 



APPENDIX 1 



A Simple Example 

This example gives the complete yacc applications for a small 
desk calculator; the calculator has 26 registers labeled " a" 
through " z" and accepts arithmetic expressions made up of the 
operators +, — , *, /, % (med operator), & (bitwise and), | 
(bitwise or), and assignments. If an expression at the top level 
is an assignment, the value is printed; otherwise, the expression 
is printed. As in C language, an integer that begins with 
(zero) is assumed to be octal; otherwise, it is assumed to be 
decimal. 

As an example of a yacc specification, the desk calculator does 
a reasonable job of showing how precedence and ambiguities 
are used and demonstrates simple recovery. The major 
oversimplifications are that the lexical analyzer is much 
simpler for most applications, and the output is produced 
immediately line by line. Note the way that decimal and octal 
integers are read in by grammar rules. This job is probably 
better done by the lexical analyzer. 

%{ 

# includes<stdio.h> 

# includes<ctype.h> 

int regs[26]; 
int base; 

%} 

% start list 
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% token DIGIT LETTER 

%left'|' 

%left '&' 

%left '+' '-' 

%left'*' V '%' 

% left UMINUS /* supplies precedence for unary minus */ 

% % /* beginning of rule section */ 

list : /* empty */ 
I list Stat '\n' 
I list error '\n' 



{ 
yyerrork; 

} 



stat : expr 

{ 



printf("%dn",$l ); 

} 

I LETTER =' expr 

{ 
regs[$l] = $3 

} 



expr : *(' expr ')' 

{ 

$$ = $2; 

} 

I expr *+' expr 

{ 

$$ = $1 + $3 

} 

I expr *-' expr 
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number 



= $1 - $3 



expr '*' expr 



$$ = $1 * $3; 
expr V expr 

$$ = $l/$3; 
exp '%' expr 

$$ = $1 % $3 
expr *&' expr 

$$ = $!& $3; 
expr ' I ' expr 

$$ = $1 1 $3 
'-'expr %precUMINUS 

$$ = - $2; 
LETTER 

$$ = reg[$l]; 
number 



: DIGIT 
$$ = $1; base = ($1==0) ? 8 ; 10; 
I number DIGIT 
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{ 

$$ = bas * $1 + $2 

} 



% % /* start of program */ 

yylex( ) /* lexical analysis routine */ 

{ /* return LETTER for lowercase letter, 

yylval = through 25*/ 

/* returns DIGIT for digit, yylval = through 9*/ 
/* all other characters are returned immediately */ 

int c; 

/*skip blanks*/ 
while (c=getchar( ) ) = = ") 



/* c is now nonblank */ 

if( islower( c )) 

{ 

yylval = c- 'a'; 
return( LETTER ); 

} 

if( isdigit( c )) 

} 

yylval = c-'O'; 
return( DIGIT ); 

} 

return( c ); 
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APPENDIX 2 



YACC Input Syntax 

This appendix has a description of the yacc input systax as a 
yacc specification. Context dependencies, etc. are not 
considered. Ironically, the yacc input specification language is 
most naturally specified as an LR(2) grammar; the sticky part 
comes when an identifier is seen in a rule immediately 
following an action. If this identifier is followed by a colon, it is 
the start of the next rule; otherwise, it is a continuation of the 
current rule which just happens to have an action embedded in 
it. As implemented, the lexical analyzer looks ahead after 
seeing an identifier and decides whether the next token 
(skipping blanks, newlines, and comments, etc.) is a colon. If so, 
it returns the token C_IDENTIFIER. Otherwise, it returns 
IDENTIFIER. Literals (quoted strings) are also returned as 
IDENTIFIERS but never as part of CJDENTIFIERs. 



/* grammar for the input to yacc */ 

/* basic entries */ 
% token IDENTIFIER /* includes identifiers and literals */ 
% token CJDENTIFIER /* identifier (but not literal) 

followed by a colon */ 
% token NUMBER /* [0-9]+ */ 

/* reserved words: %type=> TYPE %left=>LEFT,etc. */ 



% token LEFT RIGHT NONASSOC TOKEN PREC TYPE START UNION 

% token MARK /* the %% mark */ 

% token LCURL /* the % { mark */ 

% token RCURL /* the % } mark */ 

/* ASCII character literals stand for themselves */ 

% token spec 
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spec : defs MARK rules tail 



tail : MARK 

{ 

In this action, eat up the rest of the file 

} 

I /* empty: the second MARK is optional */ 



defs : /* empty */ 

I defs def 



defs : START IDENTIFIER 

I UNION 

{ 

Copy union definition to output 

} 

I LCURL 

{ 

Copy C code to output file 

RCURL 

} 

I ndefs rword tag nlist 



rword : TOKEN 
I LEFT 
I RIGHT 
I NONASSOC 
I TYPE 



tag : /* empty: union tag is optional */ 
I '<' IDENTIFIER V 
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nlist : nmno 

I nlist nmno 
I nlist'/nmno 



nmno : IDENTIFIER /*Note: literal illegal with % type */ 
I IDENTIFIER NUMBER /* Note: illegal with % type */ 



/* rule section */ 

rules : CJDENTIFIER rbody proc 
I rules rule 



rule : CJDENTIFIER rbody prec 
1*1' rbody prec 



rbody : /* empty */ 

I rbody IDENTIFIER 
I rbody act 



act : '{' 

{ 
Copy action translate $$ ' etc. 

} 

r 



Bprec : /* empty */ 

I PREC IDENTIFIER 
I PREC IDENTIFIER act 
I preeV 



22-54 



YACC 



APPENDIX 3 



An Advanced Example 

This appendix gives an example of a grammar using some of 
the advanced features. The desk calculator example in 
Appendix 1 is modified to provide a desk calculator that does 
floating point interval arithmetic. The calculator understands 
floating point constants; the arithmetic operations +, -, *, /, 
unary -, and " a" through " z" . Moreover, it also understands 
intervals written 



(X,Y) 



where X is less than or equal to Y. There are 26 interval valued 
variables " A" through " Z" that may also be used. The usage 
is similar to that in Appendix 1; assignments return no values 
and print nothing while expressions print the (floating or 
interval) value. 

This example explores a number of interesting features of yacc 
and C language. Intervals are represented by a structure 
consisting of the left and right endpoint values stored as 
doubles. This structure is given a type name, INTERVAL, by 
using typedef. The yacc value stack can also contain floating 
point scalars and integers (used to index into the arrays 
holding the variable values). Notice that the entire strategy 
depends strongly on being able to assign structures and unions 
in C language. In fact, many of the actions call functions that 
return structures as well. 

It is also worth noting the use of YYERROR to handle error 
conditions— division by an interval containing and an interval 
presented in the wrong order. The error recovery mechanism of 
yacc is used to throw away the rest of the offending line. 
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In addition to the mixing of types on the value stack, this 
grammar also demonstrates an interesting use of syntax to 
keep track of the type (for example, scalar or interval) of 
intermediate expressions. Note that scalar can be 
automatically promoted to an interval if the context demands 
an interval value. This causes a large number of conflicts when 
the grammar is run through yacc— 18 Shift/Reduce and 26 
Reduce/Reduce. The problem can be seen by looking at the two 
input lines: 

2.5+(3.5-4.) 

and 

2.5 + ( 3.5,4 ) 

Notice that the 2.5 is to be used in an interval value expression 
in the second example, but this fact is not known until the 
comma is read. By this time, 2.5 is finished, and the parser 
cannot go back and change its mind. More generally, it might 
be necessary to look ahead an arbitrary number of tokens to 
decide whether to convert a scalar to an interval. This problem 
is evaded by having two rules for each binary interval valued 
operator— one when the left operand is a scalar and one when 
the left operand is an interval. In the second case, the right 
operand must be an interval, so the conversion will be applied 
automatically. Despite this evasion, there are still many cases 
where the conversion may be applied or not, leading to the 
above conflicts. They are resolved by listing the rules that yield 
scalars first in the specification file; in this way, the conflict 
will be resolved in the direction of keeping scalar valued 
expressions scalar valued until they are forced to become 
intervals. 

This way of handling multiple types is very instructive but not 
very general. If there were many kinds of expression types 
instead of just two, the number of rules needed would increase 
dramatically and the conflicts even more dramatically. Thus, 
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while this example is instructive, it is better practice in a more 
normal programming language environment to keep the type 
information as part of the value and not as part of the 
grammar. 

Finally, a word about the lexical analysis. The only unusual 
feature is the treatment of floating point constants. The C 
language library routine atofO is used to do the actual 
conversion from a character string to a double precision value. 
If the lexical analyzer detects an error, it responds by returning 
a token that is illegal in the grammar provoking a syntax error 
in the parser and thence error recovery. 

%{ 

#include<stdio.h> 
#include<ctype.h> 

typedef struct interval 

{ 

double lo, hi; 
} INTERVAL; 

INTERVAL vmulO, vdiv( ); 

double atof( ); 

double dreg[ 26 ]; 
INTERVAL vreg[ 26 ]; 

%] 

% start line 

% union 

int ival; 
double dval; 
INTERVAL vval; 

} 
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% token <ival> DREG VREG /*indices into dreg, vreg arrays */ 

% token <dval> CONST /* floating point constant */ 

%type <dval> dexp /* expression */ 

%type <vval> vexp /* interval expression */ 

/* precedence information about the operators */ 

%left '+"-' 
%left '*"/' 
% left UMINUS /* precedence for unary minus */ 

% % 

lines : /* empty */ 
I lines line 

line : dexp *\n' 

printf("%15.8f\n".$l ); 

vexp '\n' 

printf( " (%15.8f , %15.8f )0,$l.lo,$l.hi ); 

DREG'=' V 

dreg[$l] = $3; 

I VREG '=' vexp \n' 

vreg[$l] = $3; 
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dexp 



error \n' 

yyerrork; 



: CONST 




DREG 






$$ = dreg[$l] 


dexp 


'+' dexp 






$$ = $1 + $3 


dexp 


'-' dexp 






$$ = $1 


-$3 


dexp 


'*' dexp 






$$ = $1 


*=$3 


dexp 


V dexp 





$$ = $1 / $3 

} 

I '-'dexp %precUMINUS 
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{ 

$$ =- $2 

I '(' dexp')' 

$$ = $2 

vexpp : dexp 

$$.hi = $$.lo = $1; 
I *(' dexp',' dexp')' 



$$.lo = $2; 
$$.hi = $4; 
If( $$.lo > $$.hi ) 

{ 

printf( " interval out of order n" ); 

YYERROR; 

} 
} 
I VREG 

{ 



vreg[$l] 



} 

I vexp *+' vexp 

{ 



$$.hi = $l.hi + $3.hi; 
$$.lo = $l.lo + $3.1o 
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dexp '+' vexp 



.hi = $1 + $3.hi; 
.lo = $1 + $3.1o 



vexp *=' vexp 



).hi = $l.hi - $3.1o; 
>.lo = $l.lo - $3.hi 



dvep '-' vdep 



5.hi = $1 - $3.1o; 
l\o = $1 - $3.hi 



vexp '*' vexp 



$$ = vmul( $l.lo,$.hi,$3 ) 
dexp **' vexp 

$$ = vmul( $1, $1, $3 ) 



vexp V vexp 



if( dcheck( $3 ) ) YYERROR; 
$$ = vdiv( $l.lo, $l.hi, $3 ) 
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I dexp V vexp 



if( dcheck( $3 ) ) YYERROR; 
$$ = vdiv( $l.lo, $l.hi, $3 ) 

} 

I '-'vexp %precUMINUS 

{ 

$$.hi = -$2.1o;$$.lo =-$2.hi 

} 

I '(' vexp J 

} 

$$ = $2 

} 



%% 

# define BSZ 50 /* buffer size for floating point number */ 
/* lexical analysis */ 



yyiex( ) 

{ 



register c; 

/* skip over blanks */ 



if( isupper( c ) ) 

{ 

yylval.ival = c - 'A' 
return( VREG ); 

} 

if( islower( c ) ) 

{ 



yylval.ival == c - 'a', 
return( DREG ); 
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} 

/* gobble up digits, points, exponents */ 
if(idigit(c)||c==V) 

{ 

char buf[BSZ+l], *cp = buf; 
int dot = 0, exp = 0; 

for( ; (cp-buf)<BSZ ; ++cp,c=getchar( ) ) 

{ 

*cp = c; 

if( isdigit( c ) ) 

continue; 
if( c =='.' 

{ 

if( dot++||exp ) 

return( '.' );/* will cause syntax error */ 

continue; 

} 

if( c == 'e') 

{ 

if( exp++ ) 

return( *e' ); */will cause syntax error */ 
continue; 

} 

/* end of number */ 

break; 

} 

*cp = \0'; 

if(cp-buff) >= BSZ ) 

printcf( " constant too long truncated\n" ); 
else 

ungetc( c, stdin ); /* push back last char read */ 
yylval.dval = atof( buf ); 
return( CONST ); 

} 

return( c ); 
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} 

INTERVAL 
hilo( a, b, c, d ) 

double a, b, c, d; 

{ 

/* returns the smallest interval containing a, b, c, and d */ 

/* used by *,/ routine */ 
INTERVAL v; 

if( a>b ) 

{ 

v.hi = a; 

v.lo = b; 

} 
else 

{ 

v.hi = b; 

v.lo = a; 

} 

if( od ) 

{ 

if( ov.hi ) 

v.hi = c; 

if( d<v.lo ) 

v.lo = d; 

} 
else 

} 

if( d>v.hi ) 

v.hi = d; 

if( c<v.lo ) 

v.lo = c; 

} 

return( v ); 

} 

INTERVAL vmul( a, b, v ) 

double a, b; 

INTERVAL v; 
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{ 

return( hilo( a*v.hi, a*v,lo, b*v.hi, b*v.lo ) ); 

} 

dcheck( v ) 

INTERVAL v; 

{ 

if( v.hi >=0.&& v.lo <=0. ) 

{ 

printf( " divisor internal contains 0.\n" ); 
return( 1); 

} 

return( ); 

{ 

INTERVAL vdiv( a, b, v ) 
double a, b; 
INTERVAL v; 

{ 
return( hilo( a/v.hi, a/v,lo, b/v.hi, b/v.lo ) ); 

} 
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Old Features Supported But Not Encouraged 

This appendix mentions synonyms and features that are 
supported for historical continuity but, for various reasons, are 
not encouraged. 



1. Literals may also be delimited by double quotes. 

2. Literals may be more that one character long. If all the 
characters are alphabetic, numeric, or _, the type 
number of the literal is defined just as if the literal did 
not have the quotes around it. Otherwise, it is difficult 
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to find the value for such literals. 

The use of multicharacter literals is likely to mislead 
those unfamiliar with yacc since it suggests that yacc 
is doing a job which must actually be done by the lexical 
analyzer. 

3. Most places where % is legal, backslash " \" may be 
used. In particular, \\ is the same as % % , \lef t the same 
as % left, etc. 

4. There are a number of other synonyms: 

% < is the same as % left 

% > is the same as % right 

% binary and % 2 are the same as % nonassoc 

% and % term are the same as % token 

% = is the same as % prec 

5. Action may also have the form 

= { ... } 

and the curly braces can be dropped if the action is a 
single C language statement. 

6. The C language code between % { and % } used to be 
permitted at the head of the rules section as well as in 
the declaration section. 
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Chapter 23 

UNIX SYSTEM TO UNIX SYSTEM 
COPY— "UUCP*' 



INTRODUCTION 

The UUCP network has provided a means of information 
exchange between UNIX systems over the direct distant dialing 
network for several years. This chapter provides you with the 
background to make use of the network. 

The first half of the document discusses concepts. 
Understanding these basic principles helps the user make the 
best possible use of the uucp network. The second half 
explains the use of the user level interface to the network and 
provides numerous examples. 

There are several major uses of the network. Some of the uses 
are: 



• Distribution of software 

• Distribution of documentation 

• Personal communication (mail) 

• Data transfer between closely sited machines 

• Transmission of debugging dumps and data exposing bugs 

• Production of hard copy output on remote printers. 
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THE UUCP NETWORK 

The uucp(l) network is a network of UNIX systems that allows 
file transfer and remote execution to occur on a network of 
UNIX systems. The extent of the network is a function of both 
the interconnection hardware and the controlling network 
software. Membership in the network is tightly controlled via 
the software to preserve the integrity of all members of the 
network. You cannot use the uucp facility to send files to 
systems that are not part of the uucp network. The following 
parts describe the topology, services, operating rules, etc., of the 
network to provide a framework for discussing use of the 
network. 



Network Hardware 

The uucp was originally designed as a dialup network so that 
systems in the network could use the DDD network to 
communicate with each other. The three most common 
methods of connecting systems are: 

1. Connecting two UNIX systems directly by cross-coupling 
(via a null modem) two of the computers ports. This means 
of connection is useful for only short distances (several 
hundred feet can be achieved although the RS232 standard 
specifies a much shorter distance) and is usually run at 
high speed (9600 baud). These connections run on 
asynchronous terminal ports. 

2. Using a modem (a private line or a limited distance 
modem) to directly connect processors over a private line 
(using 103- or 212-type data sets). 

3. Connecting a processor to another system through a 
modem, an automatic calling unit (ACU) or an internal 
modem on the UNIX PC, and the DDD network. This is 
by far the most common interconnection method, and it 
makes available the largest number of connections. 
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Network Topology 

A large number of connections between systems are possible 
via the DDD network. The topology of the network is 
determined by both the hardware connections and the software 
that controls the network. The next two parts deal with how 
that topology is controlled. 



Hardware Topology 

As discussed earlier, it is possible to build a network using 
permanent or dial up connections. In Figure 23-1, a group of 
systems (A, B, C, D, and E) are shown connected via hard-wired 
lines. All systems are assumed to have some answer-only data 
sets so that remote users or systems can be connected. A few 
systems have automatic calling units (K, D, F, and G) and one 
system (H) has no capability for calling other systems. Users 
should be aware that the network consists of a series of point- 
to-point connections (A-B, B-C, D-B, E-B) even though it 
appears in Figure 23-1 that A and C are directly connected 
through B. The following observations are made: 

1. System H is isolated. It can be made part of the network 
by arranging for other systems to poll it at fixed intervals. 
This is an important concept to remember since transfers 
from systems that are polled do not leave the system until 
that system is called by a polling system. 

2. Systems K, F, G, and D easily reach all other systems since 
they have calling units. 

3. If system A (E or G) wishes to send a file to H (K, F, or 
G), it must first send it to D (via system B) since D is the 
only system with a calling unit. 
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Figure 23-1. UUCP Nodes 

Software Topology 

The hardware capability of systems in the network defines the 
maximum number of connections in the network. The software 
at each node restricts the access by other systems and thereby 
defines the extent of the network. The systems of Figure 23-1 
can be configured so that they appear as a network of systems 
that have equal access to each other or some restrictions can be 
applied. As part of the security mechanism used by uucp, the 
extent of access that other systems have can be controlled at 
each node. Figures 23-2 and 23-3 show how the network might 
appear at one node. 
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Figure 23-2. UUCP Network Excluding One Node 




Figure 23-3. UUCP Network With Several Levels of 
Permissions 
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Access is available from all systems in Figure 23-2, however, in 
Figure 23-3 some of the systems have been configured to have 
greater or less access privileges than others (i.e., systems C, E, 
and G have one set of access privileges, systems F and B have 
another set, etc.). 

The UUCP uses the UNIX system password mechanism coupled 
with a system file (/usrAib/uucp/L.sys) and a file system 
permission file (/usrAib/uucp/USERFILE) to control access 
between systems. The password file entries for uucp (usually, 
luucp, nuucp, uucp, etc.) allow only those remote systems 
that know the passwords for these IDs to access the local 
system. (Great care should be taken in revealing the password 
for these uucp logins since knowing the password allows a 
system to join the network.) The system file 
(/usrAib/uucp/L.sys) defines the remote systems that a local 
host knows about. This file contains all information needed for 
a local host to contact a remote system (including system name, 
password, login sequence, etc.) and as such is protected from 
viewing by ordinary users. 

In summary, while the available hardware on a network of 
systems determines the connectivity of the systems, the 
combination of password file entries and the uucp system files 
determine the extent of the network. 



Forwarding 

One of the recent additions to uucp (for UNIX system 5.0) is a 
limited forwarding capability whereby systems that are part of 
the network can forward files through intermediate nodes. For 
example, in Figure 23-1, it is possible to send a file between 
node _A and C_ through intermediate node R For security 
reasons, when forwarding, files may only be transmitted to the 
public area or fetched from the remote system's public area. 
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Security 

The most critical feature of any network is the security that it 
provides. Users are familiar with the security that UNIX 
systems provide in protecting files from access by other users 
and in accessing the system via passwords . In building a 
network of processors, the notion of security is widened because 
access by a wider community of users is granted. Access is 
granted on a system basis (that is, access is granted to all users 
on a remote system). This follows from the fact that the 
process of sending (receiving) a file to (from) another system is 
done via daemons that use one special user ID(s). This user 
ID(s) is granted (denied) access to the system via the uucp 
system file (/usrAib/uucp/L.sys) and the areas that the system 
has access to are controlled by another file 
(/usrAib/uucp/USERFILE). For example, access can be 
granted to the entire file system tree or limited to specific 
areas. 



Software Structure 

The uucp network is a batch network. That is, when a request 
is made, it is spooled for later transmission by a daemon. This 
is important to users because the success or failure of a 
command is only known at some later time via mail(l) 
notification. For most transfers, there is little trouble in 
transmitting files between systems, however, transmissions are 
occasionally delayed or fail because a remote system cannot be 
reached. 



Rules of the Road 

There are several rules by which the network runs. These rules 
are necessary to provide the smooth flow of data between 
systems and to prevent duplicate transmissions and lost jobs. 
The following sections outline these rules and their influence on 
the network. 
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Queuing 

Jobs submitted to the network are assigned a sequence number 
for transmission. Jobs are represented by a file (or files) in a 
common spool directory (/usr/spool/uucp). When a file 
transfer daemon (uucico) is started to transmit a job, it 
selects a system to contact and then transmits all jobs to that 
system. Before breaking off the conversation, any jobs to be 
received from that remote system are accepted. The system 
selected as the one to contact is randomly selected if there is 
work for more than one system. In releases of uucp prior to 
UNIX system 5.0, the first system appearing in the spool 
directory is selected so preference is given to the most recently 
spawned jobs. Uucp may be sending to or receiving from 
many systems simultaneously. The number of incoming 
requests is only limited by the number of connections on the 
system, and the number of outgoing transfers is limited by the 
number of ACUs (or direct connections). 



Dialing and the DDD Network 

In order to transfer data between processors that are not 
directly connected, an auto dialer is used to contact the remote 
system. There are several factors that can make contacting a 
remote system difficult. 

1. All lines to the remote system may be busy. There is a 
mechanism within uucp that restricts contact with a 
remote system to certain times of the day (week) to 
minimize this problem. 

2. The remote system may be down. 

3. There may be difficulty in dialing the number (especially if 
a large sequence of numbers involving access through PBXs 
is involved). The dialing algorithm tries dialing a number 
twice and the algorithm used to dial remote systems is not 
perfect, particularly when intermediate dial tones are 
involved. 
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Scheduling and Polling 

When a job is submitted to the network, an attempt to contact 
that system is made immediately . Only one conversation at a 
time can exist between the same two systems. 

Systems that are polled can do nothing to force immediate 
transmission of data. Jobs will only be transmitted when the 
system is polled (hourly, daily, etc.) by a remote system. 



Retransmissions and Hysteresis 

The UUCP network is fairly persistent in its attempt to contact 
remote systems to complete a transmission. To prevent uucp 
from continually calling systems that are unavailable, 
hysteresis is built into the algorithm used to contact other 
systems. This mechanism forces a minimum fixed delay 
(specifiable on a per system basis) to occur before another 
transmission can take place to that system. 



Purging and Cleanup 

Transfers that cannot be completed after a defined period of 
time (72 hours is the value that is set when the system is 
distributed) are deleted and the user is notified. 



Special Places: The Public Area 

In order to allow the transfer of files to a system for which a 
user does not have a login, the public directory (usually kept in 
/usr/spool/uucppublic) is available with general access 
privileges. When receiving files in the public area, the user 
should dispose of them quickly as the administrative portion of 
uucp purges this area on a regular basis. 
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Permissions 



File Level Protection 



In transferring files between systems, users should make sure 
that the destination area is writable by uucp. The uucp 
daemons preserve execute permission between systems and 
assign permission 0666 to transferred files. 



System Level Protection 

The system administrator at each site determines the global 
access permissions for that processor. Thus, access between 
systems may be confined by the administrator to only some 
sections of the file system. 



Forwarding Permissions 

The forwarding feature is a new addition to the uucp package. 
You should be aware that 



1. When forwarding is attempted through a node that is 
running an old version of uucp, the transmission fails. 

2. Nodes that allow forwarding can restrict the forwarding 
feature in several ways. 

a. Forwarding is allowed for only certain users. 

b. Forwarding to certain destination nodes (e.g., 
Australia) should be avoided. 

c. Forwarding for selected source nodes is allowed. 

3. The most important restriction is that forwarding is 
alloAved only for files sent to or fetched from the public 
area. 
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NETWORK USAGE 

The following parts discuss the user interface to the network 
and give examples of command usage. 



Name Space 

In order to reference files on remote systems, a syntax is 
necessary to uniquely identify a file. The notation must also 
have several defaults to allow the reference to be compact. 
Some restrictions must also be placed on pathnames to prevent 
security violations. For example, pathnames may not include 
" ,." as a component because it is difficult to determine 
whether the reference is to a restricted area. 



Naming Conventions 

Uucp uses a special syntax to build references to files on 
remote systems. The basic syntax is 



system-namelpathname 

where the system-name is a system that uucp is aware of. The 
pathname part of the name may contain any of the following: 

1. A fully qualified pathname such as 

mhtsa!/usr/you/file 
The pathname may also be a directory name as in 
mhtsa!/usr/you/directory 

2. The login directory on a remote may be specified by use of 
the " character. The combination "user references the login 
directory of a user on the remote system. For example, 
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mhtsaradm/file 

would expand to 

mhtsa!/usr/sys/adm/file 

if the login directory for user adm on the remote system is 
/usr/sys/adm. 

3. The public area is referenced by a similar use of the prefix 
"/user preceding the pathname. For example, 

mhtsa!Vyou/file 

would expand to 

mhtsa!/usr/spool/uucp/you/file 

if /usr/spool/uucp is used as the spool directory. 

4. Pathnames not using any of the combinations or prefixes 
discussed above are prefixed with the current directory (or 
the login directory on the remote). For example, 

mhtsalfile 

would expand to 

mhtsa!/usr/you/file 

The naming convention can be used in reference to either the 
source or destination file names. 
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Forwarding Syntax 

The newest feature of uucp is the ability to allow files to be 
passed between systems via intermediate nodes. This is done 
via a variation of the bang (!) syntax that describes the path to 
be taken to reach that file. For example, a user on system _a_ 
wishing to transmit a file to system _e_ might specify the 
transfer as 



uucp file b!c!d!e!Vyou/file 

if the user desires the request to be sent through b, c, and d 
before reaching e. Note that the pathname is the path that the 
file would take to reach node_e^ Note also that the destination 
must be specified as the public area. Fetching a file from 
another system via intermediate nodes is done similarly. For 
example, 

uucp b!c!d!e!~/you/file x 

fetches file from system e and renames it_x_on the local system. 
The forwarding prefix is the path from the local system and 
not the path from the remote to the local system. The 
forwarding feature may also be used in conjunction with 
remote execution. For example, 

uux mhtsaluucp mhtsb!mhrtc!/usr/spool/uucppublic/file x 

sends a request to mhtsa to execute the uucp command to copy 
a file from mhrtc to x on mhtsa. 
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Types of Transfers 



Uucp has a very flexible command syntax for file transmission. 
The following sections give examples of different combinations 
of transfers. 



Transmissions of Files to a Remote 

Any number of files can be transferred to a remote system via 
uucp. The syntax supports the *, ? and [..] metacharacters. 
For example, 



uucp *.[ch] mhtsaldir 

transfers all files whose name ends in_c^or_h_to the directory 
dir in the user's login directory on mhtsa . 

Fetching Files From a Remote 

Files can be fetched from a remote system in a similar manner. 
For example, 

uucp mhtsa!*. [ch] dir 

will fetch all files ending in _c_ or J]_ from the user's login 
directory on mhtsa and place the copies in the subdirectory dir 
on the local system. 

Switching 

Transmission of files can be arranged in such a way that the 
local system effectively acts as a switch. For example, 

uucp mhtsblfiles mhtsalfiled 
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will fetch files from the user's login directory on mhtsb , rename 
it as filed, and place it in the login directory on mhtsa. 



Broadcasting 

Broadcast capability (that is, copying a file to many systems) is 
not supported by uucp, however, it can be simulated via a shell 
script as in 

for i in mhtsa mhtsb mhtsd 
do 

UUCP file $i!broad 
done 

Unfortunately, one uucp command is spawned for each 
transmission so that it is not possible to track the transfer as a 
single unit. 



Remote Executions 

The remote execution facility allows commands to be executed 
remotely. For example, 

uux " !diff mhtsa!/etc/passwd mhtsd!/etc/passwd > Ipass.diff 

will execute the command diff(l) on the password file on mhtsa 
and mhtsd and place the result in pass.diff. 



Spooling 

To continue modifying a file while a copy is being transmitted 
across the network, the — c option should be used. This forces a 
copy of the file to be queued. The default for uucp is not to 
queue copies of the files since it is wasteful of both CPU time 
and storage. For example, the following command forces the 
file work to be copied into the spool directory before it is 
transmitted: 
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UUCP -c work mhtsa!Vyou/work 



Notification 

The success or failure of a transmission is reported to users 
asynchronously via the mail(l) command. A new feature of 
UUCP is to provide notification to the user in a file (of the 
users choice). The choices for notification are: 

1. Notification returned to the requester's system (via the — m 
option). This is useful when the requesting user is 
distributing files to other machines. Instead of logging 
onto the remote machine to read mail, mail is sent to the 
requester when the copy is finished. 

2. A variation of the — m option is to force notification in a 
file (using the —mfile option where file is a file name). For 
example, 

uucp -mans /etc/passwd mhtsb!/dev/null 

sends the file /etc/passwd to system mhtsb and places the 
file in the bit bucket (/dev/null). The status of the 
transfer is reported in the file ans as: 



UUCP job 0306 (8/20-23:08:09) (0:31:23) /etc/passwd copy succeeded 



Uux(l) always reports the exit status of the remote 
execution unless notification is suppressed (via the — n 
option). Notification can be sent to a different user on the 
remote system via the — nuser option. 
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Tracking and Status 

The most pervasive change to the uucp package is revising the 
internal formatting of jobs so that each invocation of uucp or 
uux(l) corresponds to a single job. It is now possible to 
associate a single job number with each command execution so 
that the job can be terminated or its status obtained. 



The Job ID 

The default for the uucp and uux command is not to print the 
job number for each job. This was done for compatibility with 
previous versions of uucp and to prevent the many shell scripts 
built around uucp from printing job numbers. If the following 
environment variable: 



JOBNO=ON 

is made part of the user's environment and exported, uucp and 
uux print the job number. Similarly, if the user wishes to turn 
the job numbers off, the environment variable is set as follows: 

JOBNO=OFF 

If you wish to force printing of job numbers without using the 
environment mechanism, use the — j option. For example, 

uucp -j /etc/passwd mhtsb!/dev/null 
UUCP job 282 

forces the job number (282) to be printed. If the — j option is 
not used, the IDs of the jobs (belonging to the user) are found 
by using the uustat(l) command. This provides the job 
number. For example. 
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uustat 

0282 torn mhtsb 08/20-21:47 08/20-21:47 JOB IS QUEUED 

0272 torn mhtsb 08/20-21:46 08/20-21:46 JOB IS QUEUED 

shows that the user has two jobs (282 and 272) queued. 

Job Status 

The uustat command allows a user to check on one or all jobs 
that have been queued. The ID printed when a job is queued is 
used as a key to query status of the particular job. An example 
of a request for the status of a given job is: 

uustat -J0711 

0711 torn mhtsb 07/30-02:18 07/30-02:18 JOB IS QUEUED 

There are several status messages that may be printed for a 
given job; the most frequent ones are JOB IS QUEUED and 
JOB COMPLETED (meanings are obvious). The manual page 
for uustat lists the other status messages. 

Network Status 

The status of the last transfer to each system on the network is 
found by using the uustat command. For example, 

uustat -mall 

reports the status of the last transfer to all of the systems 
known to the local system. The output might appear as 



mhbSc 08/10-12:35 CONVERSATION SUCCEEDED 

resear 08/20-17:01 CONVERSATION SUCCEEDED 

minimo 07/22-16:31 DIAL FAILED 
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austra 08/20-18:36 WRONG TIME TO CALL 

ucbvax 08/20-20:37 LOGIN FAILED 

where the status indicates the time and state of the last 
transfer to each system. When sending files to a system that 
has not been contacted recently, it is a good idea to use uustat 
to see when the last access occurred (because the remote system 
may be down or out of service). 



Job Control 

With the unique job ID generated for each uucp or uux 
command, it is possible to control jobs in the following ways. 



Job Termination 

A job that consists of transferring many files from several 
different systems can be terminated using the — k option of 
uustat. If any part of the job has left the system, then only 
the remaining parts of the job on the local system are 
terminated. 



Requeuing a Job 

The uucp package clears jobs out its working area on a regular 
basis (usually every 72 hours) to prevent the buildup of jobs 
that cannot be delivered. The — r option is used to force the 
date of a job to be changed to the current date, thereby 
lengthening the time that uucp attempts to transmit the job. It 
should be noted that the — r option does not impart immortality 
to a job. Rather, it only postpones deleting the job during 
housekeeping functions until the next cleanup. 
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Network Names 



Users may find the names of the systems on the network via 
the uuname(l) command. Only the names of the systems in 
the network are printed. 



UTILITIES THAT USE UUCP 

There are several utilities that rely on uucp or uux(l) to 
transfer files to other systems. The following parts outline the 
more important of these functions. This increases awareness of 
the extent of the use of the network. 



Mail 

The mail(l) command uses uux to forward mail to other 
systems. For example, when a user types: 

mail mhtsaltom 

the mail command invokes uux to execute rmail on the 
remote system (rmail is a link to the mail command). 
Forwarding mail through several systems (e.g., mail a!b!tom) 
does not use the uucp forwarding feature but is simulated by 
the mail command itself. 



Uuto 

The uuto(l) command uses the uucp facility to send files while 
allowing the local system to control the file access. Suppose 
your login is emsgene and you are on system aaaaa. You have a 
friend (David) on system bbbbb with a login name of wldmc. 
Also assume that both systems are networked to each other 
[See uuname(l)]. To send files using uuto, enter the 
following: 
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uuto filename aaaaalwldmc 

where filename is the name of a file to be sent. The files are 
sent to a public directory defined in the uucp source. In this 
example, David will receive the following mail: 

From nuucp Tue Jan 25 11:09:55 1983 
/usr/spool/uucppublic/receive/wldmc/aaaaa\ 
//filename from aaaaalemsgene arrived 

See uuto(l) for more details. 

Other Applications 

Some sites have replaced utilities such as Ipr(l), opr(l), etc., 
with shell scripts that invoke uux or uucp. Other sites use the 
uucp network as a backup for higher speed networks (e.g., 
PCL, NSC HYPERchannel*, etc.). 



Trademark of Network Systems Corporation. 
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SYSTEM SOFTWARE FILE LIST 

The following lists show the names of all the UNIX system files 
contained in the Software Distribution Sets. These Sets consist 
of a series of diskettes containing a complete listing of files. 
The software diskettes are shown in alpabetical order by the 
name of the software set. 



Diagnostic Diskette 
File Listing 

/s4diagnostic 
/unix 



Floppy Boot Diskette 
File Listing 

/UNIX3.0 
/unix 



Floppy Filesystem Diskette 
File Listing 



/bin 

/bin/cat 

/bin/cp 

/bin/cpio 

/bin/echo 

/bin/In 

/bin/Is 

/bin/mkdir 

/bin/mv 

/bin/pwd 

/bin/sh 

/dev 



/dev/console 

/dev/fpOOO 

/dev/fp002 

/dev/fp003 

/dev/fp020 

/dev/fp021 

/dev/kmem 

/dev/lp 

/dev/mem 

/dev/null 

/dev/rawlp 

/dev/rfpOOO 
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/dev/rfp002 

/dev/rfp003 

/dev/rfp020 

/dev/rfp021 

/dev/swap 

/dev/syscon 

/dev/systty 

/dev/tty 

/dev/ttyOOO 

/dev/wl 

/dev/w2 

/dev/w3 

/dev/w4 

/dev/window 

/etc 

/etc/dismount 

/etc/group 

/etc/ldrcpy 

/etc/mkfs 

/etc/mnttab 

/etc/mnttab.hd 

/etc/mount 

/etc/passwd 

/etc/profile 

/etc/profile.fd 

/etc/profile.hd 

/etc/reboot 

/etc/umount 

/files2.0 

/findem 

/lib 

/lib/shlib 

/list 

/mnt 

/tmp 

Hard Disk Boot Diskette 
File Listing 

/UNIX3.0 
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/unix 

Foundation Set 
File Listing 



/bin 

/bin/basename 

/bin/cat 

/bin/chgrp 

/bin/chmod 

/bin/chown 

/bin/clear 

/bin/cmp 

/bin/cp 

/bin/cpio 

/bin/date 

/bin/dd 

/bin/df 

/bin/diff 

/bin/dirname 

/bin/du 

/bin/echo 

/bin/ed 

/bin/env 

/bin/expr 

/bin/false 

/bin/file 

/bin/find 

/bin/grep 

/bin/head 

/bin/kill 

/bin/Id 

/bin/line 

/bin/In 

/bin/login 

/bin/Is 

/bin/mail 

/bin/mc68k 

/bin/mesg 



/bin/mkdir 

/bin/mid 

/bin/mv 

/bin/newgrp 

/bin/nice 

/bin/nohup 

/bin/od 

/bin/passwd 

/bin/pdpll 

/bin/pr 

/bin/ps 

/bin/pwd 

/bin/red 

/bin/rm 

/bin/rmail 

/bin/rmdir 

/bin/rsh 

/bin/scrset 

/bin/sed 

/bin/sh 

/bin/size 

/bin/sleep 

/bin/sort 

/bin/stty 

/bin/su 

/bin/sum 

/bin/sync 

/bin/tail 

/bin/tee 

/bin/telinit 

/bin/time 

/bin/touch 

/bin/true 

/bin/tty 
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/bin/u370 

/bin/u3b 

/bin/uname 

/bin/vax 

/bin/wc 

/bin/who 

/bin/ write 

/dev 

/dev/console 

/dev/error 

/dev/fpOOO 

/dev/fp002 

/dev/fp003 

/dev/fp020 

/dev/fp021 

/dev/kmem 

/dev/mem 

/dev/null 

/dev/lp 

/dev/phO 

/dev/phl 

/dev/rawlp 

/dev/rfpOOO 

/dev/rfpOOl 

/dev/rfp002 

/dev/rfp003 

/dev/rfp020 

/dev/rfp021 

/dev/swap 

/dev/syscon 

/dev/systty 

/dev/tty 

/dev/ttyOOO 

/dev/window 

/dev/wl 

/dev/w2 

/dev/w3 

/dev/w4 

/dev/w5 

/dev/w6 



/dev/w7 

/dev/w8 

/dev/w9 

/dev/wlO 

/dev/wl 1 

/dev/wl2 

/etc 

/etc/.cleanup 

/etc/.drvload 

/etc/. extra 

/etc/.lineone 

/etc/.linetwo 

/etc/.lpstartsched 

/etc/.rs232 

/etc/.firstrc 

/etc/.version 

/etc/TZ 

/etc/checklist 

/etc/cleanup.wk 

/etc/convert 

/etc/convert/CONVERSIONS 

/etc/convert/convert 

/etc/convert/copyback 

/etc/convert/formconvert 

/etc/convert/rcconvert 

/etc/convert/uaconvert 

/etc/devnm 

/etc/dismount 

/etc/fsck 

/etc/getty 

/etc/gettydefs 

/etc/group 

/etc/init 

/etc/inittab 

/etc/ioctl. sy scon 

/etc/iv 

/etc/killall 

/etc/lddrv 

/etc/lddrv/InstDrv 

/etc/lddrv/drivers 
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/etc/lddrv/lddrv 

/etc/lddrv/lipc.o 

/etc/lddrv/mkifile 

/etc/lddrv/unix.sym 

/etc/magic 

/etc/master 

/etc/masterupd 

/etc/mkfs 

/etc/mknod 

/etc/mnttab 

/etc/motd 

/etc/mount 

/etc/mountable 

/etc/namesys 

/etc/passwd 

/etc/ph 

/etc/printers 

/etc/profile 

/etc/pwcntl 

/etc/rc 

/etc/reboot 

/etc/setmnt 

/etc/shutdown 

/etc/smgr 

/etc/termcap 

/etc/umount 

/etc/umountable 

/etc/update 

/etc/wall 

/etc/wmgr 

/lib 

/lib/shlib 

/mnt 

/tmp 

/u 

/u/install 

/u/install/.phdir 

/u/install/.profile 

/u/install/Environment 

/u/install/Filecabinet 



/u/install/Filecabinet/ 

Profiles 
/u/install/Filecabinet/ 

Profiles/1200bps:Am 
/u/install/Filecabinet/ 

Profiles/300bps:Am 
/u/install/Filecabinet/ 

Profiles/9600bps:A2 
/u/install/ Administration 
/u/install/Software 
/u/tutor 
/u/tutor/.phdir 
/u/tutor/.profile 
/u/tu tor/Environment 
/u/tutor/Filecabinet 
/u/tutor/Filecabinet/ 

Profiles 
/u/tutor/Filecabinet/ 

Profiles/1200bps:Am 
/u/tutor/Filecabinet/ 

Profiles/300bps:Am 
/u/tutor/Filecabinet/ 

Profiles/9600bps:A2A 
/u/tutor/Filecabinet/ 

practice 
/u/tutor/Filecabinet/ 

practice/example.hlp 
/u/tutor/Filecabinet/ 

practice/windows.hlp 
/.profile 
/UNIX3.0 
/unix 
/usr 

/usr/adm 
/usr/adm/cronlog 
/usr/bin 
/usr/bin/.!, 
/usr/bin/Backup.sh 
/usr/bin/Diagnos/sh 
/usr/bin/Fcopy.sh 
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/usr/bin/Fformat.sh 

/usr/bin/FlpyChk.sh 

/usr/bin/Install.sh 

/usr/bin/Instcpio.sh 

/usr/bin/Ldriver 

/usr/bin/Lsys.sh 

/usr/bin/MsdosF.sh 

/usr/bin/MsdosR.sh 

/usr/bin/MsdosW.sh 

/uysr/bin/Namesys.sh 

/usr/bin/Pclear.sh 

/usr/bin/Phones.sh 

/usr/bin/RS232.sh 

/usr/bin/RSfree.sh 

/usr/bin/Restore.sh 

/usr/bin/Showsoft/sh 

/usr/bin/Ulogin 

/usr/bin/Uninstall.sh 

/usr/bin/Users.sh 

/usr/bin/getoff.sh 

/usr/bin/geton.sh 

/usr/bin/asa 

/usr/bin/async_main 

/usr/bin/awk 

/usr/bin/banner 

/usr/bin/bc 

/usr/bin/cancel 

/usr/bin/comb 

/usr/bin/comm 

/usr/bin/crypt 

/usr/bin/csplit 

/usr/bin/cu 

/usr/bin/cut 

/usr/bin/dc 

/usr/bin/disable 

/usr/bin/enable 

/usr/bin/erricon 

/usr/bin/fc 

/usr/bin/fdfmt.nl 

/usr/bin/fdfmt.sl 



/usr/bin/fdfmt.vl 

/usr/bin/fgrep 

/usr/bin/findem 

/usr/bin/getopt 

/usr/bin/getterm 

/usr/bin/id 

/usr/bin/info 

/usr/bin/join 

/usr/bin/lp 

/usr/bin/lpinfo 

/usr/bin/lpstat 

/usr/bin/lpsetup 

/usr/bin/message 

/usr/bin/more 

/usr/bin/msdos 

/usr/bin/md_write 

/usr/bin/md_format 

/usr/bin/newwind 

/usr/bin/nl 

/usr/bin/page 

/usr/bin/password 

/usr/bin/paste 

/usr/bin/path 

/usr/bin/phconvert 

/usr/bin/phcreate 

/usr/bin/phnum 

/usr/bin/phpref 

/usr/bin/phstub 

/usr/bin/pwait 

/usr/bin/pwdmenu 

/usr/bin/setdate 

/usr/bin/setgetty 

/usr/bin/setuname 

/usr/bin/shform 

/usr/bin/split 

/usr/bin/spr 

/usr/bin/sprint 

/usr/bin/tr 

/usr/bin/tutor 

/usr/bin/ua 
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/usr/bin/uahelp 

/usr/bin/uaupd 

/usr/bin/umodem 

/usr/bin/uniq 

/usr/bin/uucp 

/usr/bin/uucppwd 

/usr/bin/uulog 

/usr/bin/uuname 

/usr/bin/uupick 

/usr/bin/uustat 

/usr/bin/uuto 

/usr/bin/uux 

/usr/installed 

/usr/installed/.list 

/usr/lib 

/usr/lib/accept 

/usr/lib/crontab 

/usr/lib/diffh 

/usr/lib/iv 

/usr/lib/iv/FDnl 

/usr/lib/iv/FDsl 

/usr/lib/iv/FDvl 

/usr/lib/iv/atasi40 

/usr/lib/iv/atasi50 

/usr/lib/iv/hitachi50 

/usr/lib/iv/loader 

/usr/lib/iv/maxtor40 

/usr/lib/iv/miniscribelO-3 

/usr/lib/iv/miniscribe20-4 

/usr/lib/iv/rodime40 

/usr/lib/iv/s41oad.silent 

/usr/lib/iv/s41oad.verbose 

/usr/lib/lib.b 

/usr/lib/lpadmin 

/usr/lib/lpmove 

/usr/lib/lpqueue 

/usr/lib/lpsched 

/usr/lib/lpshut 

/usr/lib/makekey 

/usr/lib/more.help 



/usr/lib/ua 

/usr/lib/ua/1200bps:Am 

/usr/lib/ua/300bps:Am 

/usr/lib/ua/9600bps:A2 

/usr/lib/ua/ Administration 

/usr/lib/ua/Backuser.menu 

/usr/lib/ua/Environment 

/usr/lib/ua/Floppy 

/usr/lib/ua/Hardware 

/usr/lib/ua/Installn.form 

/usr/lib/ua/Login.form 

/usr/lib/ua/Lsys.form 

/usr/lib/ua/Lsys2.form 

/usr/lib/ua/Lsys2s.form 

/usr/lib/ua/Mail 

/usr/lib/ua/Mailph.form 

/usr/lib/ua/Namesys.form 

/usr/lib/ua/Office 

/usr/lib/ua/Others 

/usr/lib/ua/Phones.form 

/usr/lib/ua/Preferences 

/usr/lib/ua/Printers 

/usr/lib/ua/RS232a.form 

/usr/lib/ua/RS232biorm 

/usr/lib/ua/RS232c.form 

/usr/lib/ua/RS232d.form 

/usr/lib/ua/RS232e.form 

/usr/lib/ua/Restore.form 

/usr/lib/ua/Showsoft.menu 

/usr/lib/ua/Software 

/usr/lib/ua/Suffixes 

/usr/lib/ua/Uninstall.menu 

/usr/lib/ua/User.form 

/usr/lib/ua/admin.hlp 

/usr/lib/ua/keymap 

/usr/lib/ua/keynames 

/usr/lib/ua/kmap.5410 

/usr/lib/ua/kmap.5420 

/usr/lib/ua/kmap.5425 

/usr/lib/ua/kmap.b513 



A-7 



SYSTEM SOFTWARE FILE LIST 



/usr/lib/ua/kmap.hp 


/usr/lib/wfont/system.8.ft 


/usr/lib/ua/kmap.tvi925 


/usr/lib/wfont/system.r/8.ft 


/usr/lib/ua/kmap.vtlOO 


/usr/mail 


/usr/lib/ua/phnum 


/usr/practice 


/usr/lib/ua/phone.hlp 


/usr/practice/practice.hlp 


/usr/lib/ua/ua.hlp 


/usr/practice/practice.hlp/ 


/usr/lib/ua/uasetx 


exaiiiple.hlp 


/usr/lib/ua/uasig 


/usr/practice/practice.hlp/ 


/usr/lib/uucp 


windows.hlp 


/usr/lib/uucp/.OLD 


/usr/practice/tutor.err 


/usr/lib/uucp/.XQTDIR 


/usr/practice/tutor.err2 


/usr/lib/uucp/L-devices 


/usr/practice/tutor.msg 


/usr/lib/uucp/L-dialcodes 


/usr/practice/tutor.rst 


/usr/lib/uucp/L.cmds 


/usr/spool 


/usr/lib/uucp/L.sys 


/usr/spool/lp 


/usr/lib/uucp/L_stat 


/usr/spool/lp/class 


/usr/lib/uucp/L_sub 


/usr/spool/lp/interface 


/usr/lib/uucp/R_stat 


/usr/spool/lp/member 


/usr/lib/uucp/R_sub 


/usr/spool/lp/model 


/usr/lib/ucp/USERFILE 


/usr/spool/lp/model/ 


/usr/lib/uucp/modemcap 


dumb 


/usr/lib/uucp/uucico 


/usr/spool/lp/model/ 


/usr/lib/uucp/uuclean 


dumb_S 


/usr/lib/uucp/uudemon.day 


/usr/spool/lp/model/ 


/usr/lib/uucp/uudemon.hr 


dumb-remote 


/usr/lib/uucp/uudemon.wk 


/usr/spool/lp/pstatus 


/usr/lib/uucp/uusub 


/usr/spool/lp/qstatus 


/usr/lib/uucp/uuxqt 


/usr/spool/lp/request 


/usr/lib/wfpnt 


/usr/spool/uucp 


/usr/lib/wfont/BLD.ft 


/usr/spool/uucppublic 


/usr/lib/wfont/ELD.ft 


/usr/tmp 


/usr/lib/wfont/PLAIN.I.E.12.A 




/usr/lib/wfont/ROMCft 




/usr/lib/wfont/ROMGit 




/usr/lib/wfont/SCLD.ft 




/usr/lib/wfont/UKit 




/usr/lib/wfont/VBMit 




/usr/lib/wfont/monitor.8.ft 




/usr/lib/wfont/mosaic.8.ft 




/usr/lib/wfont/special.8.ft 





A-8 



SYSTEM SOFTWARE FILE LIST 



Development Set 
File Listing 



/bin/adb 

/bin/ar 

/bin/as 

/bin/cc 

/bin/dump 

/bin/ksh 

/bin/lorder 

/bin/make 

/bin/mas 

/bin/mcc 

/bin/nm 

/bin/sdb 

/bin/strip 

/bin/tset 

/etc/bcopy 

/etc/chroot 

/etc/clri 

/etc/cron 

/etc/fsdb 

/etc/ncheck 

/etc/whodo 

/lib/ccom 

/lib/crtO.o 

/lib/crtOs.o 

/lib/ifile.0407 

/lib/ifile.0410 

/lib/ifile.0413 

/lib/shlib.ifile 

/lib/libc.a 

/lib/libg.a 

/lib/libm.a 

/lib/libPW.a 

/lib/libp 

/lib/libp/libca 

/lib/mccom 

/lib/mcpp 



/lib/cpp 

/lib/mcrtO.o 

/lib/moptim 

/lib/optim 

/usr/bin/admin 

/usr/bin/bdiff 

/usr/bin/cal 

/usr/bin/cb 

/usr/bin/cdc 

/usr/bin/cflow 

/usr/bin/cfont 

/usr/bin/cmpdt 

/usr/bin/cxref 

/usr/bin/delta 

/usr/bin/diff3 

/usr/bin/dircmp 

/usr/bin/egrep 

/usr/bin/factor 

/usr/bin/get 

/usr/bin/help 

/usr/bin/ipcrm 

/usr/bin/ipcs 

/usr/bin/lex 

/usr/bin/lint 

/usr/bin/logname 

/usr/bin/m4 

/usr/bin/pack 

/usr/bin/pcat 

/usr/bin/prof 

/usr/bin/prs 

/usr/bin/regcmp 

/usr/bin/rmchg 

/usr/bin/rmdel 

/usr/bin/sact 

/usr/bin/sccsdiff 

/usr/bin/sdiff 
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/usr/bin/tar 

/usr/bin/tsort 

/usr/bin/unget 

/usr/bin/units 

/usr/bin/unpack 

/usr/bin/val 

/usr/bin/vc 

/usr/bin/what 

/usr/bin/xargs 

/usr/bin/yacc 

/usr/include 

/usr/include/a.out.h 

/usr/include/alarm.h 

/usr/include/aouthdr.h 

/usr/include/ar.h 

/usr/include/assert.h 

/usr/include/core.h 

/usr/include/ctype.h 

/usr/include/curses.h 

/usr/include/dial.h 

/usr/include/ 

dumprestor.h 
/usr/include/errno.h 
/usr/include/exch.h 
/usr/include/execargs.h 
/usr/include/fatal.h 
/usr/include/fcntl.h 
/usr/include/filehdr.h 
/usr/include/form.h 
/usr/include/ftw.h 
/usr/include/gdioctl.h 
/usr/include/grp.h 
/usr/include/kcodes .h 
/usr/include/ldfcn.h 
/usr/include/linenum.h 
/usr/include/lp.h 
/usr/include/macros.h 
/usr/include/Makepre.h 
/usr/include/Makepost.h 
/usr/include/math.h 



/usr/include/memory.h 

/usr/include/menu.h 

/usr/include/message.h 

/usr/include/mnttab.h 

/usr/include/mon.h 

/usr/include/nan.h 

/usr/include/pbf.h 

/usr/include/pwd.h 

/usr/include/regexp.h 

/usr/include/reloc.h 

/usr/include/rje.h 

/usr/include/scnhdr.h 

/usr/include/search .h 

/usr/include/setj mp.h 

/usr/include/sgs.h 

/usr/include/sgtty.h 

/usr/include/signal.h 

/usr/include/stand.h 

/usr/include/status.h 

/usr/include/stdio. h 

/usr/include/storclass.h 

/usr/include/string.h 

/usr/include/symbol.h 

/usr/include/syms.h 

/usr/include/sys 

/usr/include/sys/acct.h 

/usr/include/sys/buf.h 

/usr/include/sys/callo.h 

/usr/include/sys/cmap.h 

/usr/include/sys/conf.h 

/usr/include/sys/dialer.h 

/usr/include/sys/dir.h 

/usr/include/sys/dmap.h 

/usr/include/sys/drv.h 

/usr/include/sys/err.h 

/usr/include/sys/errno.h 

/usr/include/sys/fblk.h 

/usr/include/sys/file.h 

/usr/include/sys/filsys.h 

/usr/include/sys/font.h 
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/usr/include/sy; 

/usr/include/sys/gdisk 

/usr/include/sys/hardware 

/usr/include/sys/hard 

/usr/include/sys/i8274 

/usr/include/sys/init. 

/usr/include/sys/ino 

/usr/include/sys/inode 

/usr/include/sys/iobuf. 

/usr/include/sys/ioctl 

/usr/include/sys/iohw 

/usr/include/sys/iohw 

/usr/include/sys/ipc 

/usr/include/sys/kbd 

/usr/include/sys/lapbt 

/usr/include/sys/lock 

/usr/include/sys/lprio 

/usr/include/sys/map 

/usr/include/sys/modem 

/usr/include/sys/mount 

/usr/include/sys/mo 

/usr/include/sys/msg 

/usr/include/sy 

/usr/include/sys/ph 

/usr/include/sy 

/usr/include/sys/p 

/usr/include/sy! 

/usr/include/sys/reg 

/usr/include/sys/rti 

/usr/include/sys/s 

/usr/include/sys/shm 

/usr/include/sys/signal 

/usr/include/sys/slot 

/usr/include/sys/space 

/usr/include/sys/spl 

/usr/include/sys/st 

/usr/include/sys/stat 

/usr/include/sys/st 

/usr/include/sys/sysinfo 

/usr/include/sys/syslocal 



s/gdioctl.h 
h 
h 
ware.m 
h 
.h 
h 
•h 
h 
h 
h 
m 
h 
h 

r.h 
.h 
h 
h 
h 
h 
use.h 
h 
s/param.h 

.h 
s/phone.h 

roc.h 
s/pte.h 
h 
c.h 
em.h 
h 

•h 
h 

h 
h 
h 

h 
ermio.h 
.h 
.h 



/usr/include/sys/sysmacros.h 

/usr/include/sys/systm.h 

/usr/include/sys/target.h 

/usr/include/sys/termio.h 

/usr/include/sys/text.h 

/usr/include/sys/times.h 

/usr/include/sys/trap.h 

/usr/include/sys/ttold.h 

/usr/include/sys/tty.h 

/usr/include/sys/tune.h 

/usr/include/sys/types.h 

/usr/include/sys/user.h 

/usr/include/sys/utsname.h 

/usr/include/sys/vadvise.h 

/usr/include/sys/var.h 

/usr/include/sys/vlimit.h 

/usr/include/sys/vm.h 

/usr/include/sys/vmmac.h 

/usr/include/sys/vmmeter.h 

/usr/include/sys/vmparam.h 

/usr/include/sys/vmsystm.h 

/usr/include/sys/vtimes.h 

/usr/include/sys/wait.h 

/usr/include/sys/wd.h 

/usr/include/sys/window.h 

/usr/include/syslocal.h 

/usr/include/tam.h 

/usr/include/termio.h 

/usr/include/time.h 

/usr/include/tp_defs.h 

/usr/include/track.h 

/usr/include/ustat.h 

/usr/include/utmp.h 

/usr/include/values.h 

/usr/include/varargs.h 

/usr/include/wind.h 

/usr/lib/dag 

/usr/lib/diff3prog 

/usr/lib/flip 

/usr/lib/help 
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/usr/lib/help/ad 

/usr/lib/help/bd 

/usr/lib/help/cb 

/usr/lib/help/cm 

/usr/lib/help/cmds 

/usr/lib/help/co 

/usr/lib/help/de 

/usr/lib/help/default 

/usr/lib/help/ge 

/usr/lib/help/he 

/usr/lib/help/prs 

/usr/lib/help/rc 

/usr/lib/help/un 

/usr/lib/help/ut 

/usr/lib/help/vc 

/usr/lib/lex 

/usr/lib/lex/ncform 

/usr/lib/lex/nrform 

/usr/lib/lib300.a 

/usr/lib/libSOOs.a 

/usr/lib/lib4014.a 

/usr/lib/lib450.a 

/usr/lib/libcurses.a 

/usr/lib/libdev.a 

/usr/lib/libl.a 

/usr/lib/libld.a 

/usr/lib/libmath.a 

/usr/lib/libplot.a 

/usr/lib/libtam.a 

/usr/lib/libtermcap.a 

/usr/lib/libtermlib.a 

/usr/lib/libvtO.a 

/usr/lib/liby.a 

/usr/lib/lintl 

/usr/lib/lint2 

/usr/lib/llib-lc 

/usr/lib/llib-lcln 

/usr/lib/llib-port 

/usr/lib/llib-port.ln 

/usr/lib/llib-lm 



/usr/lib/llib-lm.ln 

/usr/lib/lpfx 

/usr/lib/nmf 

/usr/lib/reject 

/usr/lib/ua/DEVSuffixes 

/usr/lib/ua/tam.a 

/usr/lib/unittab 

/usr/lib/xcpp 

/usr/lib/xpass 

/usr/lib/yaccpar 

/usr/preserve 
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/usr/bin/300 

/usr/bin/300s 

/usr/bin/4014 

/usr/bin/450 

/usr/bin/checkcw 

/usr/bin/checkeq 

/usr/bin/checkmm 

/usr/bin/col 

/usr/bin/cw 

/usr/bin/deroff 

/usr/bin/diffmk 

/usr/bin/eqn 

/usr/bin/greek 

/usr/bin/hp 

/usr/bin/hyphen 

/usr/bin/mm 

/usr/bin/mmt 

/usr/bin/mvt 

/usr/bin/neqn 

/usr/bin/newform 

/usr/bin/nroff 

/usr/bin/osdd 

/usr/bin/ptx 

/usr/bin/spell 

/usr/bin/tabs 

/usr/bin/tbl 

/usr/bin/tc 

/usr/lib/eign 

/usr/lib/help/term 

/usr/lib/help/text 

/usr/lib/macros 

/usr/lib/macros/an 

/usr/lib/macros/cmp.n.d.an 

/usr/lib/macros/cmp.n.d.m 

/usr/lib/macros/cmp.n.t.an 

/usr/lib/macros/cmp.n.t.m 



/usr/lib/macros/mmn 

/usr/lib/macros/osdd 

/usr/lib/macros/ptx 

/usr/lib/macros/ucmp.n.an 

/usr/lib/macros/ucmp.n.m 

/usr/lib/macros/ vmca 

/usr/lib/spell 

/usr/lib/spell/compress 

/usr/lib/spell/hashcheck 

/usr/lib/spell/hashmake 

/usr/lib/spell/hlista 

/usr/lib/spell/hlistb 

/usr/lib/spell/spellin 

/usr/lib/spell/spellprog 

/usr/lib/spell/hstop 

/usr/lib/spell/spellhist 

/usr/lib/suftab 

/usr/lib/tabset 

/usr/lib/tabset/3101 

/usr/lib/tabset/beehive 

/usr/lib/tabset/diablo 

/usr/lib/tabset/std 

/usr/lib/tabset/teleray 

/usr/lib/tabset/tvi925 

/usr/lib/tabset/vtlOO 

/usr/lib/tabset/xeroxl720 

/usr/lib/term 

/usr/lib/term/tab2631 

/usr/lib/term/tab2631-c 

/usr/lib/term/tab2631-e 

/usr/lib/term/tab300 

/usr/lib/term/tab300-12 

/usr/lib/term/tab300S 

/usr/lib/term/tab300S-12 

/usr/lib/term/tab300s 

/usr/lib/term/tab300s-12 
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/usr/lib/term/tab37 

/usr/lib/term/tab382 

/usr/lib/term/tab4000A 

/usr/lib/term/tab4000a 

/usr/lib/term/tab450 

/usr/lib/term/tab450-12 

/usr/lib/term/tab832 

/usr/lib/term/tabX 

/usr/lib/term/tabal 

/usr/lib/term/tablp 

/usr/lib/term/tabtn300 

/usr/lib/tmac 

/usr/lib/tmac/tmac.an 

/usr/lib/tmac/tmac.m 

/usr/lib/tmac/tmac.org 

/usr/lib/tmac/tmac.osd 

/usr/lib/tmac/tmac.ptx 

/usr/lib/tmac/tmac.v 

/usr/pub 

/usr/pub/eqnchar 

Enhanced Editor Set 
File Listing 

/usr/bin/bfs 

/usr/bin/edit 

/usr/bin/ex 

/usr/bin/vi 

/usr/bin/view 

/usr/lib/ex3.7preserve 

/usr/lib/ex3.7recover 

/usr/lib/ex3.7strings 



A-14 



Important Information for Users of 
the UNIX PC UNIX Programmer's Guide 



This update package contains additional information for 
use with the UNIX Programmer's Guide. Please review 
this information and keep it with your Programmer ' s 
Guide . 



Chapter 5, Compiler and C Lcinguage - Programs that use 
a symbol name longer than eight characters are 
supported in Version 3.5. When linking the new 
flexname code with the preflexname code, symbol - 
referencing errors may be generated by the loader for 
the long symbol names. You can resolve this problem by 
doing one of the following: 

1 Use the -T option with the cc to cause 
truncation of the long symbols. 

2 Use the -G option with Id to allow linking of 
older libraries to flexname code. 

The 3.0 Archive Interface Disk (Disk 12 of the UNIX 
Utilities) contains utilities to interact with 
preflexname archives. 



